Virtual Vector Method Data Cache

MitchAlsup

unread,

Feb 28, 2020, 10:52:34 PM2/28/20

to

As I have discussed here before, I have invented a means to process
instructions quickly on not-very-wide machines called the "Virtual Vector
Method". Today's topic concerns some "other" stuff one may want to do when
processing Vector Loops that would be unavailable when one is processing
scalar instructions (even several at a time.)

Vector Processing is KNOWN to do "bad" tings to data caches, generally strip
mining one or more vectors through the cache, leaving no footprint of other
data which may be of higher importance after the vector calculations have
completed. Today's topic should alleviate most of this degradation.

Vector loops come in two (2) flavors: counted loops, and conditional loops.
Counted loops are of the classical form:

for( i = 0; i < MAX; i++ )

Whereas Conditional loops are of the form:

for( i = 0; p[i]; i++ )

In the counted loop version, it would be easy for the processor to examine
the MAX variable at Loop-setup time and determine if the loop was "long"
(say more trips than "page" size) or not "long". Should the loop be long,
Loads and Stores can be processed with modified semantics.

In particular, Loads which take a miss in the data cache can be brought into
some memory buffer and forwarded into calculation from there without ever
being written into the data cache and without incurring any cycles of additional latency that would not have already occurred (and probably
saving some cycles.)

Similarly, Stores which take a miss in the Data Cache, where it is KNOWN
that every byte of he cache line will be written, can collect their
data and Write-Invalidate it beyond the Data Cache without incurring any
cycles of additional latency; indeed, this eliminates the requirement
that Stores which miss the Data Cache Read the data into the cache and
obtain Write-Permission (E or M state) prior to storing new data.

Now, taken together; consider the following loop::

{
char from[*], to[*];
for( i = 0; i < count; i++ )
to[i] = from[i];
}

A) this is a counted loop
B) in this case, to[*] and From[*] are page aligned data
C) VVM is implemented.

Hardware notices that the Loaded data is forwarded to Store Data without
ever being used, and thus, the from[*] data is not needed in the Data Cache.

Hardware also notices that the Data is cache line aligned.

In the first loop iteration, to[0] takes a miss in the data cache. A
request is formulated to the outer memory hierarchy to obtain the data
with Read-Only requested line state. from[0] takes a miss in the data
cache and it can be seen that the entire cache line will be written
so NO REQUEST is made to the outer memory hierarchy; a Cache Miss
buffer is allocated to hold Store Data.

When to[0..15] arrives it is copied to from[0..15] and i is incremented
from 0 to 16, and the 16th loop iteration begins. At this point, every
time a to[k..k+15] arrives it is copied directly to a from[k..k+15]
and the loop index is incremented by the size of the cache line. {The
L2 cache needs to be able to sustain at least 1/2 of this kind of BW
requirement}

In the VVM version of My 66000 ISA the above loop is written:

VEC 0
LDB R5,[R1+R4]
STB R5,[R2+R4]
LOOP LT,R4,#1,R3

And this loop can run at the equivalent of 5 instruction-per-cycle
TIMES the number of iterations that are performed per cycle (16)::
{hint: LOOP is 3 instructions: ADD..CMP..BLT} So we have an execution
rate of 80 IPC. {If the loop were not cache-line aligned the code
can only run at 40 IPC.} Not bad for a 1-wide IO processor ! However,
the "small" My 66100 implementation only has a cache access width of
128-bits so it can only run at 40 IPC aligned or 20 IPC misaligned.

At this point, no SW writer ever has to consider how to optimize
page-aligned-page-copy:: the byte copy loop works as well as anything
and everything else.

Also: no SW writer has to consider how to optimize byte-misaligned copy
as he will have the utmost difficulty in matching the performance of
the Vectorized byte copy.

And no unit of SW had to use those funny SSE/AVX registers to move
data around, with all the boundary effects these entail.

Either Terje or EricP noted that there might be as much as 6% of cycles
available from optimizing page-aligned-page-copies. I think the above
captures most of that 6%. Given that the Load side does not pollute the
Data cache maybe a bit more.

Given that Loads (which take a miss in the Data Cache) are processed as
streaming loads, and that Stores that take a miss in the Data Cache
AND WILL write every byte of the cache line can be processed automajically
as streaming stores, the up from latency to obtain write permission is
eliminated.

Thus, the Data Cache suffers minimal data elimination when processing
long Vectors. About all that is necessary is the ability to "step
forward" from cache line to cache line in order to keep the L2 cache
filled with streaming requests.

One does this by NOT adding the increment variable in counted loops!
Instead, one examines the low order address bits (after AGEN) and
decides how many loop increments will it take to cross over into
the next cache line. The My 66100 microArchitecture has a 128-bit
wide path to the Data Cache. Thus, a Vector Load-Vector Store pair
will take 2 cycles to move a cache line of data from the to[*] side
to the from[*] side. This means that the AGEN unit has a cycle to
step to the next cache line for the Load and for the Store each loop
<multi> iteration.

By "understanding" the counted loop but processing the loop at cache
port size and incrementing by "bytes consumed" the processor is
operating as-if it were 4-way or 8-way SuperScalar while being 1-wide!

For conditional loops, the processor can assume that the loop is going
to have enough iterations to finish the current cache line (each time)
and then take two (2) remediation steps:: a) the conditional operation
is performed across all of the data-units being transferred from to[*]
to from[*]; b) if the loop needs to be terminated, the processor will
go fetch the last Store cache line(s) with Read-with-Intent-to-Modify
command and then merge its Store buffer with the newly arriving data
before Sending the cache line back into the memory hierarchy (or into
the data cache).

I am not exactly sure how to figure out that short string copies are
detected so they end up in the data cache while long string copies
avoid damaging the data cache. One thought train is that if the string
ends up smaller than 1 (or 2) cache lines in length, it goes into
the data cache and if it is longer, it goes into the memory hierarchy.
This only requires another store buffer entry.

Because the loops are all begin with the VEC instruction and all end with
the LOOP instruction, the branch prediction tables are not modified while
Vector Looping ! This should allow the branch predictor to devote its
resources into predicting hard to predict branches as the easy to
predict branches (i.e., Loops) have been removed from consideration.

A side-line consideration to VVM::
Any taken branch inside a VEC..LOOP construct, terminates the Vector loop.
Thus, if you want to perform conditionality within a vectorized loop you
will have to use the predicate instructions available in the ISA. Thus
during compiler development, consideration is directed towards predication
prior to being directed at vectorization.

I have now formalized the VEC and LOOP instructions in the My 66000 ISA.
memmov, strcpy, and strncpy are all vectorizable (and all of the above
optimization apply).

So:

A) we get rid of the need for SW to devote time to optimizing memory-
to-memory moves
B) we enable LBIO processors to perform small simple loops as if they
were GBOoO processors.
C) we avoid strip-mining data out of the data cache when performing
Vector calculations.
D) we can optimally vectorize the whole C string library
E) compilers automatically get optimal struct-copy
F) and we don't modify the branch prediction tables while LOOPing !

Stephen Fuld

unread,

Feb 29, 2020, 5:14:22 PM2/29/20

to

On 2/28/2020 2:52 PM, MitchAlsup wrote:
> As I have discussed here before, I have invented a means to process
> instructions quickly on not-very-wide machines called the "Virtual Vector
> Method". Today's topic concerns some "other" stuff one may want to do when
> processing Vector Loops that would be unavailable when one is processing
> scalar instructions (even several at a time.)

As you know, I like VVM a lot, especially if you add the optional
"intermediate value save area" we discussed here a few weeks ago to
allow reductions to run at full precision and be spread over multiple
functional units.

I need to spend more time thinking about your proposal here, but my
initial reaction is that for memmove, etc. you are hard pressed to beat,
or even to match, a loop that moves a larger number of bytes per
iteration. Your code moves one byte, but moving a double word gains a
factor of 8 by itself. Even more gain could be gotten by using the
special 4 register loads and store instructions.

While I think most of the "tricks" you outlined could also be applied to
a "wider" version, given the frequency of these operations, I think a
better solution is the dedicated single instruction that can incorporate
them all in hardware where appropriate, and take care of the non-aligned
cases, short cases, bypassing cache, etc. automagically. That is, it
could incorporate all the things you talked about within the single
instruction. Of course, that would also save a little I cache
bandwidth, power for decode, etc.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

EricP

unread,

Feb 29, 2020, 5:31:22 PM2/29/20

to

MitchAlsup wrote:
>
> Either Terje or EricP noted that there might be as much as 6% of cycles

Twasn't me.

MitchAlsup

unread,

Feb 29, 2020, 6:06:55 PM2/29/20

to

On Saturday, February 29, 2020 at 9:14:22 AM UTC-8, Stephen Fuld wrote:
> On 2/28/2020 2:52 PM, MitchAlsup wrote:
> > As I have discussed here before, I have invented a means to process
> > instructions quickly on not-very-wide machines called the "Virtual Vector
> > Method". Today's topic concerns some "other" stuff one may want to do when
> > processing Vector Loops that would be unavailable when one is processing
> > scalar instructions (even several at a time.)
>
>
> As you know, I like VVM a lot, especially if you add the optional
> "intermediate value save area" we discussed here a few weeks ago to
> allow reductions to run at full precision and be spread over multiple
> functional units.
>
> I need to spend more time thinking about your proposal here, but my
> initial reaction is that for memmove, etc. you are hard pressed to beat,
> or even to match, a loop that moves a larger number of bytes per
> iteration. Your code moves one byte, but moving a double word gains a
> factor of 8 by itself. Even more gain could be gotten by using the
> special 4 register loads and store instructions.

This is the (THE) HW that enables code written as 1-wide (as in 1-byte
per iteration) to be "performed" as if it were 8 or 16 bytes wide.
And it can do this even without having a SW visible register that is
16-bytes wide!

My written code is a template that moves 1-byte per iteration.
The HW Data Cache, then, looks at how many bytes it could move
and moves that many instead, passing how many it moved on to the
LOOP instruction so that the increment is performed correctly.

The code is (IS) written 1-byte at a time. The HW recognizes that
each iteration moves one more byte and that additional byte is
adjacent to the first byte. So, the HW is now in a position to
move as many bytes in a row as whatever cache-line boundary is
going to limit the iteration count. It then moves as many bytes
as it can and then uses that byte count as the increment.

1-byte at a time loops running 8-16 loop iterations per cycle.

>
> While I think most of the "tricks" you outlined could also be applied to
> a "wider" version, given the frequency of these operations, I think a
> better solution is the dedicated single instruction that can incorporate
> them all in hardware where appropriate, and take care of the non-aligned
> cases, short cases, bypassing cache, etc. automagically. That is, it
> could incorporate all the things you talked about within the single
> instruction. Of course, that would also save a little I cache
> bandwidth, power for decode, etc.

The same kind of data flow occurs in lots of number crunching applications.
Take FFT, for example: here we have 4 loads feeding 10 FP calculations
which then feed 4 stores followed by ADD/CMP/BLT

The first 2 loads bring out the real and imaginary part of one array
partition, the second 2 do the same on the "other" array partition.
The HW can easily recognize that a single cache line contains the data
for 4 successive load-pairs and not access the Cache, its tags or the TLB
while still getting the requisite data.

This HW also recognizes that the Stores are using the came cache lines
as the Loads so there is no reason to "read" these lines "up". The HW
can perform the Loads (which miss) to be performed with "intent to modify"
so write permission arrives with the load and you don't have to go get
"it" by performing the Store.

Most of this (i.e., FFT) is beneficial on the power end of things as the
FP function units will have enough latency and are "thin" enough that
the calculations are more of a bottleneck than the memory performance.

Thus, this HW is appropriate for all sorts of high BW data flowing through
a machine. However, this is (IS) the piece that makes it possible to run
multiple iterations of a loop per cycle. And that is why the functionality
was not distilled into 1-7 instructions.

Anton Ertl

unread,

Feb 29, 2020, 6:25:48 PM2/29/20

to

MitchAlsup <Mitch...@aol.com> writes:
>I am not exactly sure how to figure out that short string copies are
>detected so they end up in the data cache while long string copies
>avoid damaging the data cache. One thought train is that if the string
>ends up smaller than 1 (or 2) cache lines in length, it goes into
>the data cache and if it is longer, it goes into the memory hierarchy.
>This only requires another store buffer entry.

My intuition suggests that caching would be profitable also for
significantly longer strings, but best determine the switching point
empirically. Also you may have on switching point for keeping the
stuff in L1, a different one for L2, and another for L3.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Stephen Fuld

unread,

Mar 1, 2020, 4:56:55 PM3/1/20

to

OK, I missed that in your first post. Very neat!

Some questions. What happens if the source and/or the destination are
not aligned? How do you handle what would otherwise be a big shift type
operation?

And is the hardware smart enough that if you code the test for a zero
byte after the store (for string copy) to convert that test to check
multiple bytes in one cycle?

snip

> Thus, this HW is appropriate for all sorts of high BW data flowing through
> a machine. However, this is (IS) the piece that makes it possible to run
> multiple iterations of a loop per cycle. And that is why the functionality
> was not distilled into 1-7 instructions.

I understand the motivation. The tricky part is getting all the corner
cases right. I guess that is true in general :-)

MitchAlsup

unread,

Mar 1, 2020, 5:26:55 PM3/1/20

to

When from and to are misaligned (at cache-line boundaries) it simply
takes 2 cycles per half cache line rather than 1. The first cycle
stages all the data it can from from[*] and places it in to[*]. The
second cycle it advances to[+1] so the new to[*] can accept the rest of
the data from from[*]. {either to[*] or from[*] can run out first.}

>
> And is the hardware smart enough that if you code the test for a zero
> byte after the store (for string copy) to convert that test to check
> multiple bytes in one cycle?

Thankfully, checking for multiple zeroes is not very difficult.

Terje Mathisen

unread,

Mar 1, 2020, 5:50:58 PM3/1/20

to

I appreciate getting credit but I don't remember stating that. :-)

I did note that that any relatively aligned strcpy() style operation
should run at whatever speed the memory hierarchy can support, with
misaligned data the fastest safe solution is probably to use aligned
loads, and misaligned stores only for full SIMD blocks, specialcasing
the last/partial block.

It is in fact possible that some systems would run the fastest by using
aligned loads to detect the terminator and then reloading the data with
a misaligned access so that the store will be aligned.

This depends on the relative speed of a misaligned load from $L1
compared to an internal byte shuffle to prepare for storing.

Classic Altivec had dual-source/single destination permute which meant
that it would be trivially as fast or faster than reloading, while x86
is missing this so the dual-source shuffle would need at least three
instructions.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Anton Ertl

unread,

Mar 1, 2020, 6:40:38 PM3/1/20

to

Terje Mathisen <terje.m...@tmsw.no> writes:
>I did note that that any relatively aligned strcpy() style operation
>should run at whatever speed the memory hierarchy can support, with
>misaligned data the fastest safe solution is probably to use aligned
>loads, and misaligned stores only for full SIMD blocks, specialcasing
>the last/partial block.

Why aligned loads and misaligned stores? My thoughts go for
misaligned loads and aligned stores (special handling for first/last
block), e.g.,
<https://github.com/AntonErtl/move/blob/master/avxmemcpy.c>. The
reasons are:

* A misaligned load just becomes two aligned loads plus some shifting
and merging in hardware, and many CPUs (e.g., Skylake) support more
loads than stores per cycle.

* A misaligned store may become two loads, some shifting and merging,
and two stores in the hardware. Or are store buffers so good that
they avoid all that, and they convert a sequence of adjacent
misaligned stores into a sequence of aligned stores?

MitchAlsup

unread,

Mar 1, 2020, 8:06:04 PM3/1/20

to

On Sunday, March 1, 2020 at 9:50:58 AM UTC-8, Terje Mathisen wrote:
> EricP wrote:
> > MitchAlsup wrote:
> >>
> >> Either Terje or EricP noted that there might be as much as 6% of cycles
> >
> > Twasn't me.

This leaves me wondering where the credit should be directed.

>
> I appreciate getting credit but I don't remember stating that. :-)
>
> I did note that that any relatively aligned strcpy() style operation
> should run at whatever speed the memory hierarchy can support, with
> misaligned data the fastest safe solution is probably to use aligned
> loads, and misaligned stores only for full SIMD blocks, specialcasing
> the last/partial block.
>
> It is in fact possible that some systems would run the fastest by using
> aligned loads to detect the terminator and then reloading the data with
> a misaligned access so that the store will be aligned.

Once working in HW it really doesn't mater which end you try to align.
My general plan was to load the 1/2 cache line into a buffer and
move as many bytes to the store buffer as can be done from that source.
This aligns the rest of the loads. Then fill another inbound buffer
and transfer enough to fill the outbound buffer. Then "allocate" another
outbound buffer and repeat.

>
> This depends on the relative speed of a misaligned load from $L1
> compared to an internal byte shuffle to prepare for storing.

Once you know the data is dense in memory, you recognize that all bytes
are shifted the same amount. So once set up, you just keep pumping data
through the network.

>
> Classic Altivec had dual-source/single destination permute which meant
> that it would be trivially as fast or faster than reloading, while x86
> is missing this so the dual-source shuffle would need at least three
> instructions.

IBM 360 has double wide operand single wide result shifts to do similarly
some 30 years earlier.

Kent Dickey

unread,

Mar 1, 2020, 11:05:43 PM3/1/20

to

In article <0c38fe7a-8814-4e52...@googlegroups.com>,

MitchAlsup <Mitch...@aol.com> wrote:
>As I have discussed here before, I have invented a means to process
>instructions quickly on not-very-wide machines called the "Virtual Vector
>Method". Today's topic concerns some "other" stuff one may want to do when
>processing Vector Loops that would be unavailable when one is processing
>scalar instructions (even several at a time.)

[ snip ]

>In the VVM version of My 66000 ISA the above loop is written:
>
> VEC 0
> LDB R5,[R1+R4]
> STB R5,[R2+R4]
> LOOP LT,R4,#1,R3

I've wondered about the need for the "VEC" instruction. I think the
hardware cam figure out which registers are scalar or vector by the end of the
2nd iteration.

Imagine having a 32-bit hidden VEC register which the hardware
manages (roughly creating the state of the VEC instruction, where in the VEC
regsiter each bit refers to R0-R31, and '1' means that register is scalar,
and '0' means it's a vector register). And there's a 2-bit VEC_STATUS
register indicating state. We need another VEC_RD 32-bit register. Normal
machine operation is VEC_STATUS=0 (and exiting a LOOP sets VEC_STATUS=0,
VEC_RD=0, VEC=0).

Detecting a scalar register means detecting a register which is first read
in the loop, and then later written. Other combinations are not scalar.

Upon seeing a taken LOOP instruction while VEC_STATUS=0, set VEC_STATUS=1,
meaning learn. Now each register accessed is determined
whether it is scalar or not by whether it's living across iterations. Set
VEC_RD[x]=1 if register x is read from. If register x is written to, set
VEC[x]=1 if VEC_RD[x]=1. When LOOP is seen again and taken, VEC now knows
which registers are scalars, so VEC_STATUS=2 and HW enters the usual VVM mode,
using VEC to inform the execution units appropriately as each instruction
is fetched, and then the vector mode takes over.

Why slow things down (since these first 2 iterations won't be as fast as
the usual VVM)? First, to avoid the software problem: if you ask
software to do special notation for loops (e.g., put in the VEC instruction),
it will generally make poor choices. Either unimportant loops will be
vectorized, or important loops will be missed. Second, by leaving it to
hardware, you can enhance the feature in the future without needing
software changes. As I envision it, software would ALWAYS use the LOOP
instruction for any loop, even ones it knew would be short, and just not
care since the hardware just does the right thing in all cases.

And third, a future implementation could allow nested LOOPs on GBOoO
implementations, which would be even more powerful. No software changes
would be needed, as long as software just made most/all loops use LOOP, even
nested ones. Hardware can nest LOOP as deep as it feels it can support,
dynamically.

And fourth, if you add LOOP to a LOOP prediction table, you can get these
cases right for future calls, even nested cases. Like a branch lookup to
get the target, you detect the start of a loop by remembering it, and pull
out the VEC register information, and be vectorized from the first iteration
(as long as you've seen it before). This table could be pretty small,
since it would just contain loops that lasted long enough to really take
advantage of the vectorized loops.

Taking this idea to the extreme, you technically don't need the LOOP
instruction either. But detecting loops to try to vectorize is much
harder without it, and the LOOP operation is a very useful one to have.

Kent

Ivan Godard

unread,

Mar 1, 2020, 11:54:36 PM3/1/20

to

That is a very interesting idea. Is it de novum with you, or based on
other work you could cite for us?

Stephen Fuld

unread,

Mar 2, 2020, 12:07:43 AM3/2/20

to

On 3/1/2020 3:05 PM, Kent Dickey wrote:

> In article <0c38fe7a-8814-4e52...@googlegroups.com>,
> MitchAlsup <Mitch...@aol.com> wrote:
>> As I have discussed here before, I have invented a means to process
>> instructions quickly on not-very-wide machines called the "Virtual Vector
>> Method". Today's topic concerns some "other" stuff one may want to do when
>> processing Vector Loops that would be unavailable when one is processing
>> scalar instructions (even several at a time.)
>
> [ snip ]
>
>> In the VVM version of My 66000 ISA the above loop is written:
>>
>> VEC 0
>> LDB R5,[R1+R4]
>> STB R5,[R2+R4]
>> LOOP LT,R4,#1,R3
>
> I've wondered about the need for the "VEC" instruction. I think the
> hardware cam figure out which registers are scalar or vector by the end of the
> 2nd iteration.

I'll leave discussion of having the hardware determine which registers
are scaler versus register, but note that the VEC instruction also
serves to tell the hardware where the top of the loop is. The loop
instruction doesn't contain that information. It only knows where to
loop to based on the hardware saving the address of the instruction
after the VEC.

Also, note that Mitch is also at least considering having the VEC
instruction give the address of a save area in RAM to hold large
precision intermediate values in case the loop is interupted.

MitchAlsup

unread,

Mar 2, 2020, 2:47:11 AM3/2/20

to

On Sunday, March 1, 2020 at 5:05:43 PM UTC-6, Kent Dickey wrote:
> In article <0c38fe7a-8814-4e52...@googlegroups.com>,

> MitchAlsup <?????> wrote:
> >As I have discussed here before, I have invented a means to process
> >instructions quickly on not-very-wide machines called the "Virtual Vector
> >Method". Today's topic concerns some "other" stuff one may want to do when
> >processing Vector Loops that would be unavailable when one is processing
> >scalar instructions (even several at a time.)
>
> [ snip ]
>
> >In the VVM version of My 66000 ISA the above loop is written:
> >
> > VEC 0
> > LDB R5,[R1+R4]
> > STB R5,[R2+R4]
> > LOOP LT,R4,#1,R3
>
> I've wondered about the need for the "VEC" instruction. I think the
> hardware cam figure out which registers are scalar or vector by the end of the
> 2nd iteration.

The VEC instruction serves 2 purposes (one of them slightly different than
when we talked 3-5 months ago on VVM).

Right now VEC does 2 things::
a) it directly annotates the registers with "loop carried dependencies"
b) it provides the position of the top of the loop (so the LOOP instruction
does not need a branch offset)

One can argue that one can figure out LCDs in 2 passes, however I assigned
a third unit of utility to VEC. Those registers marked in VEC are copied
out of the Loop back into scalar registers, the vector registers "in the
loop" are only used for data dependencies (and are not copied out.) Since
more data in a loop is never used again, thinning this out streamlines
the epilogue of the Loop.

>
> Imagine having a 32-bit hidden VEC register which the hardware
> manages (roughly creating the state of the VEC instruction, where in the VEC
> regsiter each bit refers to R0-R31, and '1' means that register is scalar,
> and '0' means it's a vector register). And there's a 2-bit VEC_STATUS
> register indicating state. We need another VEC_RD 32-bit register. Normal
> machine operation is VEC_STATUS=0 (and exiting a LOOP sets VEC_STATUS=0,
> VEC_RD=0, VEC=0).

VVM has no state above that of the non-vectorized ISA. What you describe
above has state that must be managed at context switch,....

>
> Detecting a scalar register means detecting a register which is first read
> in the loop, and then later written. Other combinations are not scalar.

Scalar registers "in a loop" are read and are not written. Registers which
are not carrying loop dependencies, but are written are vector registers.

>
> Upon seeing a taken LOOP instruction while VEC_STATUS=0, set VEC_STATUS=1,
> meaning learn. Now each register accessed is determined
> whether it is scalar or not by whether it's living across iterations. Set
> VEC_RD[x]=1 if register x is read from. If register x is written to, set
> VEC[x]=1 if VEC_RD[x]=1. When LOOP is seen again and taken, VEC now knows
> which registers are scalars, so VEC_STATUS=2 and HW enters the usual VVM mode,
> using VEC to inform the execution units appropriately as each instruction
> is fetched, and then the vector mode takes over.

While you could do this, it ends up a total waste if the loop is traversed
2-3-4-5 times. That only pays off when the loop count is larger.

Point 2: the compiler KNOWS this at compile time, the VEC instruction gives
a way to communicate this to HW.

Point 3: If the compiler so communicates, all one has to see if a use as
a destination to qualify the register as vector.

>
> Why slow things down (since these first 2 iterations won't be as fast as
> the usual VVM)? First, to avoid the software problem: if you ask
> software to do special notation for loops (e.g., put in the VEC instruction),
> it will generally make poor choices. Either unimportant loops will be
> vectorized, or important loops will be missed.

I am trying to vectorize loops with trip counts as low as 3 and still
be faster than if vectorization was not performed at all. Thus, even
the symbol table stuff in a C compiler can use vectorized strcpy to
fill in the symbol table from the parsed text.

> Second, by leaving it to
> hardware, you can enhance the feature in the future without needing
> software changes. As I envision it, software would ALWAYS use the LOOP
> instruction for any loop, even ones it knew would be short, and just not
> care since the hardware just does the right thing in all cases.

There is a maximum number of instructions that can be co-vectorized.

But you are correct, I want all short loops to be vectorized--unless they
contain ATOMIC instructions.

>
> And third, a future implementation could allow nested LOOPs on GBOoO
> implementations, which would be even more powerful.

I thought about this a lot, and ended up with no.

It is perfectly feasible to have additional resources execute the code after
the Loop and figure out a new Loop is on the way WHILE the current Loop is
being iterated.
But it is not feasible to nest vectorization without getting lost in the
resource complexity.

> No software changes
> would be needed, as long as software just made most/all loops use LOOP, even
> nested ones. Hardware can nest LOOP as deep as it feels it can support,
> dynamically.
>
> And fourth, if you add LOOP to a LOOP prediction table, you can get these
> cases right for future calls, even nested cases. Like a branch lookup to
> get the target, you detect the start of a loop by remembering it, and pull
> out the VEC register information, and be vectorized from the first iteration
> (as long as you've seen it before). This table could be pretty small,
> since it would just contain loops that lasted long enough to really take
> advantage of the vectorized loops.

I think just taking Looping branches out of the Branch prediction Table will
be a big enough win to make Loop prediction rather smaller. Remember we are
performing the Increment, compare, and branch in a single cycle, so the
minimum Loop time is 1 cycle, but as discussed in this thread above, one
can perform several iterations of a Loop in a single cycle.

>
> Taking this idea to the extreme, you technically don't need the LOOP
> instruction either. But detecting loops to try to vectorize is much
> harder without it, and the LOOP operation is a very useful one to have.

I am actually considering making the LOOP instruction part of the std ISA
so that the only Vectorization instruction is VEC.

LOOP performs the ADD, the COMPARE, and the BRANCH in a single cycle from
a single instruction. LOOP can also perform ADD, a compare different
register to zero, and Branch NE0 in a single cycle and single instruction.
Several other constructs are in the works so that things like strncpy
can be performed with a single LOOP instruction.
>
> Kent

Terje Mathisen

unread,

Mar 2, 2020, 7:13:40 AM3/2/20

to

Anton Ertl wrote:
> Terje Mathisen <terje.m...@tmsw.no> writes:
>> I did note that that any relatively aligned strcpy() style operation
>> should run at whatever speed the memory hierarchy can support, with
>> misaligned data the fastest safe solution is probably to use aligned
>> loads, and misaligned stores only for full SIMD blocks, specialcasing
>> the last/partial block.
>
> Why aligned loads and misaligned stores? My thoughts go for
> misaligned loads and aligned stores (special handling for first/last
> block), e.g.,
> <https://github.com/AntonErtl/move/blob/master/avxmemcpy.c>. The
> reasons are:
>
> * A misaligned load just becomes two aligned loads plus some shifting
> and merging in hardware, and many CPUs (e.g., Skylake) support more
> loads than stores per cycle.

This seems natural, but a misaligned load can straddle a protection
boundary where the terminator (C strings) is located before the
boundary, i.e. you will trap due to loading past the page end.

The perfect solution is of course how Mill does it, by having metadata
allowing the cpu to know that the final bytes are invalid but safely
delay the trap until it tries to use them.

>
> * A misaligned store may become two loads, some shifting and merging,
> and two stores in the hardware. Or are store buffers so good that
> they avoid all that, and they convert a sequence of adjacent
> misaligned stores into a sequence of aligned stores?

If you check the speed on modern cpus it is quite typical that all
misaligned stores within a cache line carries no extra cost, those that
straddle two lines take an extra cycle, while some cpus have additional
(sometimes large!) cost when straddling a page boundary.

ISTR Agner Fog having some detailed measurements here?

Anton Ertl

unread,

Mar 2, 2020, 8:56:19 AM3/2/20

to

Terje Mathisen <terje.m...@tmsw.no> writes:

>Anton Ertl wrote:
>> Why aligned loads and misaligned stores? My thoughts go for
>> misaligned loads and aligned stores (special handling for first/last
>> block), e.g.,
>> <https://github.com/AntonErtl/move/blob/master/avxmemcpy.c>. The
>> reasons are:
>>
>> * A misaligned load just becomes two aligned loads plus some shifting
>> and merging in hardware, and many CPUs (e.g., Skylake) support more
>> loads than stores per cycle.
>
>This seems natural, but a misaligned load can straddle a protection
>boundary where the terminator (C strings) is located before the
>boundary, i.e. you will trap due to loading past the page end.

Ah, yes, this misfeature of C. One alternative would be to
special-case the page-straddling case: Do an aligned load of the last
word and check for the terminator there; and use unaligned loads for
actually getting the data except if that aligned load finds the
terminator. However, implementing that probably increases the startup
overhead, which is not so great for (usually short) strings.

>> * A misaligned store may become two loads, some shifting and merging,
>> and two stores in the hardware. Or are store buffers so good that
>> they avoid all that, and they convert a sequence of adjacent
>> misaligned stores into a sequence of aligned stores?
>
>If you check the speed on modern cpus it is quite typical that all
>misaligned stores within a cache line carries no extra cost, those that
>straddle two lines take an extra cycle, while some cpus have additional
>(sometimes large!) cost when straddling a page boundary.
>
>ISTR Agner Fog having some detailed measurements here?

I have not found them, and not much recent stuff otherwise, either. I
guess I should measure it, but I don't have the time at the moment.

EricP

unread,

Mar 2, 2020, 3:36:06 PM3/2/20

to

Just a quick point: termination due to an exception and
termination due to a termination char should not work
the same way because an exception can roll back to a
convenient point of its choice as long as it is restartable,
whereas a term char is prescribed.

For example, when a variable length instruction straddles
a page boundary and faults fetching from the second page,
the processor rolls back the PC to the prior page
so it can restart, and passes both the PC and the different
exception address and exception details to the handler.

If, whenever possible, you are trying to write a whole cache line
at once to avoid fetching and merging with prior contents,
then it seems optimal to write aligned cache lines until
the length counter indicates a partial line remaining,
then do a fetch and merge. If a term char is encountered
in the source then we have no choice but to fetch the prior
dest line and merge the partial buffer into it.

If an exception occurs, since the dest cache line is always
aligned you don't have to worry about flushing some partial
store buffer content and merging it. Just roll back the
PC, source, dest, and length to the aligned store address,
and deliver the actual exception address as separate
(just like a straddling instruction does).

Stephen Fuld

unread,

Mar 2, 2020, 4:25:05 PM3/2/20

to

On 3/1/2020 9:26 AM, MitchAlsup wrote:
> On Sunday, March 1, 2020 at 8:56:55 AM UTC-8, Stephen Fuld wrote:
>> On 2/29/2020 10:06 AM, MitchAlsup wrote:
>>> On Saturday, February 29, 2020 at 9:14:22 AM UTC-8, Stephen Fuld wrote:
>>>> On 2/28/2020 2:52 PM, MitchAlsup wrote:
>>>>> As I have discussed here before, I have invented a means to process
>>>>> instructions quickly on not-very-wide machines called the "Virtual Vector
>>>>> Method". Today's topic concerns some "other" stuff one may want to do when
>>>>> processing Vector Loops that would be unavailable when one is processing
>>>>> scalar instructions (even several at a time.)

snip

>> And is the hardware smart enough that if you code the test for a zero
>> byte after the store (for string copy) to convert that test to check
>> multiple bytes in one cycle?
>
> Thankfully, checking for multiple zeroes is not very difficult.

Sure, but it has not only to do the test, but store only the bytes up to
and including the zero byte, so a little more complex. :-)

Also, is is restricted to testing for zero, or can some other value, as
specified in the conditional branch instruction, be used?

Lastly, if the conditional branch specifies a half word, i.e. for 16 bit
per character strings, does that also work correctly?

MitchAlsup

unread,

Mar 2, 2020, 6:20:59 PM3/2/20

to

On Monday, March 2, 2020 at 10:25:05 AM UTC-6, Stephen Fuld wrote:
> On 3/1/2020 9:26 AM, MitchAlsup wrote:
> > On Sunday, March 1, 2020 at 8:56:55 AM UTC-8, Stephen Fuld wrote:
> >> On 2/29/2020 10:06 AM, MitchAlsup wrote:
> >>> On Saturday, February 29, 2020 at 9:14:22 AM UTC-8, Stephen Fuld wrote:
> >>>> On 2/28/2020 2:52 PM, MitchAlsup wrote:
> >>>>> As I have discussed here before, I have invented a means to process
> >>>>> instructions quickly on not-very-wide machines called the "Virtual Vector
> >>>>> Method". Today's topic concerns some "other" stuff one may want to do when
> >>>>> processing Vector Loops that would be unavailable when one is processing
> >>>>> scalar instructions (even several at a time.)
>
>
> snip
>
> >> And is the hardware smart enough that if you code the test for a zero
> >> byte after the store (for string copy) to convert that test to check
> >> multiple bytes in one cycle?
> >
> > Thankfully, checking for multiple zeroes is not very difficult.
>
>
> Sure, but it has not only to do the test, but store only the bytes up to
> and including the zero byte, so a little more complex. :-)

Yes, byte-write-mask.

>
> Also, is is restricted to testing for zero, or can some other value, as
> specified in the conditional branch instruction, be used?

Sure, as long as it is a single value; and XOR gate performs the
comparison and a wide OR gate performs the EQUAL.

>
> Lastly, if the conditional branch specifies a half word, i.e. for 16 bit
> per character strings, does that also work correctly?

The branch does not specify width, the memory reference leading to the
branch is the only point where width is known. Then it is on the Load-
Alignment HW to produce the proper bit pattern into the register. But
once <properly> in the register the branch can consume any of the standard
widths {8,16,32,64}.

Paul A. Clayton

unread,

Mar 2, 2020, 10:06:21 PM3/2/20

to

On Friday, February 28, 2020 at 5:52:34 PM UTC-5, MitchAlsup wrote:
> As I have discussed here before, I have invented a means to process
> instructions quickly on not-very-wide machines called the "Virtual Vector
> Method".

Somewhat related: I wonder if there would be a way to architect
something like "cache oblivious" methods for blocking of vector-
friendly code — and if such would be worthwhile.

For the classic example of matrix-matrix multiplication, hardware
to support in-cache retention of transposed vectors might be
useful. Such might even be generalized to include array-of-structures
to structure-of-arrays transformations (e.g., when typical accesses
favor the array-of-structures but a localized use would benefit from
caching the "lossily compressed" form).

One of the benefits of vector registers is that such rearranged
data can be cached in the register for reuse.

In some cases, dynamic behavior choices could do better even than
microarchitecture specific software optimization. If cache capacity
is shared, a smaller blocking factor might be better than what
software would choose based on no-sharing (or larger than software
might choose assuming pessimistic sharing). With thermal throttling
(which can be caused by program-external factors), the price of
compute might change relative to the price of memory bandwidth (or
cache capacity).

(Choosing what information to communicate to hardware seems a
challenging problem. Under simple deterministic systems, "do this
like this" makes more sense, but even with complex systems I
suspect some metadata might be useful to the lower layer. The
variability in cost of storing and communicating such metadata
(as well as generating it, whether local to the core, the
computer system, the software distribution system [users giving
behavior feedback which then distributed as software patches/
updates]).)

MitchAlsup

unread,

Mar 2, 2020, 11:16:49 PM3/2/20

to

On Monday, March 2, 2020 at 4:06:21 PM UTC-6, Paul A. Clayton wrote:
> On Friday, February 28, 2020 at 5:52:34 PM UTC-5, MitchAlsup wrote:
> > As I have discussed here before, I have invented a means to process
> > instructions quickly on not-very-wide machines called the "Virtual Vector
> > Method".
>
> Somewhat related: I wonder if there would be a way to architect
> something like "cache oblivious" methods for blocking of vector-
> friendly code — and if such would be worthwhile.

What we have here is cache optimization not obliviation.

We access memory hierarchy and then consume the entire cache line but
do not stick the line in the cache (inbound) or fill up the cache line
and write-invalidate it out rather than waiting for permission.

>
> For the classic example of matrix-matrix multiplication, hardware
> to support in-cache retention of transposed vectors might be
> useful. Such might even be generalized to include array-of-structures
> to structure-of-arrays transformations (e.g., when typical accesses
> favor the array-of-structures but a localized use would benefit from
> caching the "lossily compressed" form).
>
> One of the benefits of vector registers is that such rearranged
> data can be cached in the register for reuse.

One side problem, here:: VVM does not add ANY registers to ISA in its
support of vector Looping.

>
> In some cases, dynamic behavior choices could do better even than
> microarchitecture specific software optimization. If cache capacity
> is shared, a smaller blocking factor might be better than what
> software would choose based on no-sharing (or larger than software
> might choose assuming pessimistic sharing). With thermal throttling
> (which can be caused by program-external factors), the price of
> compute might change relative to the price of memory bandwidth (or
> cache capacity).

Thermal throttling is one of the reason VVM exists. While one is
executing a VVM loop the FETCH/PARSE/DECODE stages of the pipeline
are quiescent, the RF is doing no reads or writes, but the data-
path is moving results to operands and performing calculations
about as wide and fast as possible. The quiescence of the front end
kills off about 50% of normal power consumption, the higher width
calculations probably doubles data path power. So power ends up a
about break even (100%) while running 3-4× faster.

This thread started with the notion that the data cache can decide how
many loop iterations the HW has performed that clock and feed that into
the LOOP instruction so the index register gets incremented properly.
Instead of SW driving HW, HW is telling SW how many things it just got
done.

lkcl

unread,

Mar 3, 2020, 12:58:23 PM3/3/20

to

On Friday, February 28, 2020 at 10:52:34 PM UTC, MitchAlsup wrote:

>
> I am not exactly sure how to figure out that short string copies are
> detected so they end up in the data cache while long string copies
> avoid damaging the data cache. One thought train is that if the string
> ends up smaller than 1 (or 2) cache lines in length, it goes into
> the data cache and if it is longer, it goes into the memory hierarchy.

so let me get this straight: you're effectively talking about *bypassing* the normal L1 cache rules, for data-dependent vector loops, is that correct?

(also by identifying these loops cleanly and in easily predictable fashion, not needing them to be branch-predicted, like other DSP-style zero-overhead loop instructions)

how do things work when memory accesses occur *after* the vecloop copy, both to the data being copied and the target?

if the L1 cache was (effectively) bypassed how is data consistency maintained when scalar accesses are subsequently made to the src and dest?

l.

lkcl

unread,

Mar 3, 2020, 1:16:44 PM3/3/20

to

On Monday, March 2, 2020 at 6:20:59 PM UTC, MitchAlsup wrote:
> On Monday, March 2, 2020 at 10:25:05 AM UTC-6, Stephen Fuld wrote:

> > Sure, but it has not only to do the test, but store only the bytes up to
> > and including the zero byte, so a little more complex. :-)
>
> Yes, byte-write-mask.
> >
> > Also, is is restricted to testing for zero, or can some other value, as
> > specified in the conditional branch instruction, be used?
>
> Sure, as long as it is a single value; and XOR gate performs the
> comparison and a wide OR gate performs the EQUAL.

a small digression. ARM SVE added "data dependent fail on first" LD/ST. it works by only raising a memory exception if the *very first* LD (or ST) element in the vector is on an illegal address.

for all other elements, the possibility of an exception does *not* raise that exception, it *truncates the vector length* righ there, right then.

thus, clearly, subsequent vector operations will only refer to the *valid* elememts (the ones that were LDed) and, in the case of a memcpy, because the LD was truncated, the parallel vector ST is also truncated to the correct amount as well.

it is quite clever but not very sophisticated. adding on the extra zero end terminator for strcpy gets very messy in RVV: VL has to be read out inside the loop, then a popcount carried out, and many other hoops jumped through.

it occurred to me in SV that an elegant addition to vector architectures would be to add data-dependent "truncate VL on element result is zero". more complex matches may be done by allowing those complex operations (XORs) to also have this same feature.

thus it becomes unnecessary to store a special comparator mask (with the temptation to then allow sophisticated arithmetic matching equivalent to a full CPU's ALU), as an instruction that produces a zero condition (by way of an OR, or and AND or even a subtract or shift operation) becomes the "vector length truncator".

just a thought.

l.

Terje Mathisen

unread,

Mar 3, 2020, 1:52:55 PM3/3/20

to

lkcl wrote:
> On Friday, February 28, 2020 at 10:52:34 PM UTC, MitchAlsup wrote:
>
>>
>> I am not exactly sure how to figure out that short string copies
>> are detected so they end up in the data cache while long string
>> copies avoid damaging the data cache. One thought train is that if
>> the string ends up smaller than 1 (or 2) cache lines in length, it
>> goes into the data cache and if it is longer, it goes into the
>> memory hierarchy.
>
> so let me get this straight: you're effectively talking about
> *bypassing* the normal L1 cache rules, for data-dependent vector
> loops, is that correct?

This seems very similar to the *NT load/store ops in x86, which will use
a cache line sized streaming buffer but not write this back to the
normal caches.

>
> (also by identifying these loops cleanly and in easily predictable
> fashion, not needing them to be branch-predicted, like other
> DSP-style zero-overhead loop instructions)
>
> how do things work when memory accesses occur *after* the vecloop
> copy, both to the data being copied and the target?
>
> if the L1 cache was (effectively) bypassed how is data consistency
> maintained when scalar accesses are subsequently made to the src and
> dest?

The streaming buffer effectively acts like a separate cache, with the
normal cache protocols to sync the state. I.e. any subsequent scalar
load/store will work normally, bringing the relevant line into $L1 just
as if that line was empty, owned by another core or (for vector loads)
if it was also already cache-resident.

The final possibility provides for a possible optimization where the
streaming load will fill its buffer from the normal cache instead of
having to go to ram.

Bruce Hoult

unread,

Mar 3, 2020, 3:39:34 PM3/3/20

to

On Tuesday, March 3, 2020 at 5:16:44 AM UTC-8, lkcl wrote:
> On Monday, March 2, 2020 at 6:20:59 PM UTC, MitchAlsup wrote:
> > On Monday, March 2, 2020 at 10:25:05 AM UTC-6, Stephen Fuld wrote:
>
> > > Sure, but it has not only to do the test, but store only the bytes up to
> > > and including the zero byte, so a little more complex. :-)
> >
> > Yes, byte-write-mask.
> > >
> > > Also, is is restricted to testing for zero, or can some other value, as
> > > specified in the conditional branch instruction, be used?
> >
> > Sure, as long as it is a single value; and XOR gate performs the
> > comparison and a wide OR gate performs the EQUAL.
>
> a small digression. ARM SVE added "data dependent fail on first" LD/ST. it works by only raising a memory exception if the *very first* LD (or ST) element in the vector is on an illegal address.
>
> for all other elements, the possibility of an exception does *not* raise that exception, it *truncates the vector length* righ there, right then.
>
> thus, clearly, subsequent vector operations will only refer to the *valid* elememts (the ones that were LDed) and, in the case of a memcpy, because the LD was truncated, the parallel vector ST is also truncated to the correct amount as well.
>
> it is quite clever but not very sophisticated. adding on the extra zero end terminator for strcpy gets very messy in RVV: VL has to be read out inside the loop, then a popcount carried out, and many other hoops jumped through.

If no terminating zero byte is found in the truncated vector before an inaccessible memory region then the strcpy() will quite rightly segfault on the next loop iteration, when it tries to read starting at the first byte of said inaccessible region.

strncpy() is fine too -- or at least follows the spec -- as no terminating null is to be added if one was not found before reading n bytes.

MitchAlsup

unread,

Mar 3, 2020, 5:06:16 PM3/3/20

to

On Tuesday, March 3, 2020 at 6:58:23 AM UTC-6, lkcl wrote:
> On Friday, February 28, 2020 at 10:52:34 PM UTC, MitchAlsup wrote:
>
> >
> > I am not exactly sure how to figure out that short string copies are
> > detected so they end up in the data cache while long string copies
> > avoid damaging the data cache. One thought train is that if the string
> > ends up smaller than 1 (or 2) cache lines in length, it goes into
> > the data cache and if it is longer, it goes into the memory hierarchy.
>
> so let me get this straight: you're effectively talking about *bypassing* the normal L1 cache rules, for data-dependent vector loops, is that correct?

Yes, I want inbound data that does not hit in the L1 to stream in from
wherever, and I would like outbound data that did not hit in the
cache to stream out to wherever.

>
> (also by identifying these loops cleanly and in easily predictable fashion, not needing them to be branch-predicted, like other DSP-style zero-overhead loop instructions)

This is what the VEC and LOOP instructions do.

>
> how do things work when memory accesses occur *after* the vecloop copy, both to the data being copied and the target?

Those take a miss on the cache and go looking for the data through the
memory hierarchy.

>
> if the L1 cache was (effectively) bypassed how is data consistency maintained when scalar accesses are subsequently made to the src and dest?

No data is "lost"
Loads are read up from the memory hierarchy if the data arrive in a state
not already "exclusive" then the data can be dropped as the cache line
supplying the data exists elsewhere with the same bit patterns.
The Loop count is checked to see if the Stores will completely fill a
cache line (or more) and if so, then the cache line is filled and when
filled it is broadcast to the system as Write-Invalidate. There are certain
instances where the original line has to be read up and store data merged.
>
> l.

Ivan Godard

unread,

Mar 3, 2020, 5:34:55 PM3/3/20

to

Which means your data better be organized as struct-of-arrays instead of
array-of-structs.

Paul A. Clayton

unread,

Mar 3, 2020, 5:47:45 PM3/3/20

to

On Monday, March 2, 2020 at 6:16:49 PM UTC-5, MitchAlsup wrote:
> On Monday, March 2, 2020 at 4:06:21 PM UTC-6, Paul A. Clayton wrote:

[snip]

>> Somewhat related: I wonder if there would be a way to architect
>> something like "cache oblivious" methods for blocking of vector-
>> friendly code — and if such would be worthwhile.
>
> What we have here is cache optimization not obliviation.

I did not invent the terminology. I do not like it either, but
it is the established nomenclature. Such techniques explicitly
targets more generality than microarchitecture-specific
optimizations (for which "cache-aware" is typically used, i.e.,
not "aware that caching is used" but "aware of cache sizes and
other features") but make assumptions about size hierarchy and
often about minimum sizes of different levels. (Such is more
'fuzzy memory' than complete forgetting about cache
charateristics.)

[snip]

>> For the classic example of matrix-matrix multiplication, hardware
>> to support in-cache retention of transposed vectors might be
>> useful. Such might even be generalized to include array-of-structures
>> to structure-of-arrays transformations (e.g., when typical accesses
>> favor the array-of-structures but a localized use would benefit from
>> caching the "lossily compressed" form).
>>
>> One of the benefits of vector registers is that such rearranged
>> data can be cached in the register for reuse.
>
> One side problem, here:: VVM does not add ANY registers to ISA in its
> support of vector Looping.

The point (such as it was) is that cache can act as a
microarchitectural vector register set, supporting data
reuse (and reuse of the placement transformation).

I do not propose any mechanism for providing such; I merely
suggest it could be a useful research direction. (This seems
to be two problems: first, caching a subset for reuse, second,
caching a specifically positionally transformed subset for
reuse. The first seems plausible; the second seems challenging.)

(Similar optimizations might apply to sparse vectors.)

The overhead of indirect or strided accesses could be reduced
by reusing transformations in a manner similar to reusing
a vector register. Hardware with implicit buffers seems likely
to be more efficient than general software for such
transformations (e.g., an 8x8 double precision matrix
transpose). Explicitly making a transposed copy in software
seems to add noticeable overhead.

Again, I am just pointing to a potential direction for
research/thought.

Anton Ertl

unread,

Mar 3, 2020, 7:07:45 PM3/3/20

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:

>Terje Mathisen <terje.m...@tmsw.no> writes:
>>If you check the speed on modern cpus it is quite typical that all
>>misaligned stores within a cache line carries no extra cost, those that
>>straddle two lines take an extra cycle, while some cpus have additional
>>(sometimes large!) cost when straddling a page boundary.
>>
>>ISTR Agner Fog having some detailed measurements here?
>
>I have not found them, and not much recent stuff otherwise, either. I
>guess I should measure it, but I don't have the time at the moment.

I spent the time anyway. You can find the results at:
http://www.complang.tuwien.ac.at/anton/unaligned-stores/:

Concerning your claims:

1) On Intel CPUs misaligned stores within a cache line do indeed carry
no extra cost, but on AMD CPUs, they do.

2) Unaligned cross-cache-line stores cost 2 cycles extra on most CPUs
(compared to aligned stores); exceptions are Skylake and Zen2, with
one cycle penalty.

3) Page-crossing stores have a large cost on all CPUs, from 23-24
cycles extra on Haswell, Skylake, Zen and Zen2 to about 200 cycles on
Sandy Bridge and Ivy Bridge.

By contrast, the penalties for unaligned loads are much lower on K8 (3
cycles for page crossing), Sandy Bridge and Ivy Bridge (24-28 cycles)
<2015Mar3...@mips.complang.tuwien.ac.at>. I should repeat these
measurements on more modern hardware. Well, not today.

MitchAlsup

unread,

Mar 3, 2020, 9:33:01 PM3/3/20

to

Which is detected as a "dense" set of memory references.

Stephen Fuld

unread,

Mar 3, 2020, 9:34:56 PM3/3/20

to

On 2/28/2020 2:52 PM, MitchAlsup wrote:

snip

> I am not exactly sure how to figure out that short string copies are
> detected so they end up in the data cache while long string copies
> avoid damaging the data cache. One thought train is that if the string
> ends up smaller than 1 (or 2) cache lines in length, it goes into
> the data cache and if it is longer, it goes into the memory hierarchy.

> This only requires another store buffer entry.

Certainly not the same, but perhaps helpful.

In 1980, I wrote the microcode for the first caching disk controller.
We has a sort of similar situation. What I decided to do was to cache
the first and last track of the transfer, and not "pollute" the cache
with any intermediate tracks.

The logic was that there was a higher than average likelihood that the
first track would be reused, especially if it was read, and could soon
be the target for a subsequent write update. The last track was cached
since there was a higher than typical likelihood that this was part of a
sequential series of transfers, thus the rest of that track had a higher
than normal chance of being referenced.

On the other hand, "middle" tracks were much less likely to be
referenced, and if they were the subject of a subsequent write, at least
we saved some time by starting the transfer to the first track without
the need for a seek, rotational latency etc.

Also, this automatically "correctly" handled the case of short transfers
that happened to cross a track boundary. i.e. both tracks would be cached.

Hope this helps.

MitchAlsup

unread,

Mar 3, 2020, 9:39:46 PM3/3/20

to

The question is how does one distinguish the kinds of vectors while
stripmine the cache and make it essentially worthless from code
which looks exactly the same but is 1/4 of the cache size or less?

There is a very good chance that all of this vector stuff ends up in
L2 or L3 rather than DRAM in any event.

>
> I do not propose any mechanism for providing such; I merely
> suggest it could be a useful research direction. (This seems
> to be two problems: first, caching a subset for reuse, second,
> caching a specifically positionally transformed subset for
> reuse. The first seems plausible; the second seems challenging.)
>
> (Similar optimizations might apply to sparse vectors.)
>
> The overhead of indirect or strided accesses could be reduced
> by reusing transformations in a manner similar to reusing
> a vector register. Hardware with implicit buffers seems likely
> to be more efficient than general software for such
> transformations (e.g., an 8x8 double precision matrix
> transpose). Explicitly making a transposed copy in software
> seems to add noticeable overhead.

I might note that when high performance number crunching is coded,
there is a specific block dealing with each transpose ordering--
classic DGEMM has 4 different code blocks each written differently
so that the 4 transposes all operate in as optimal as possible
Row-Major order.

having seen this code, I don't ever expect a compiler to see a matrix
multiplication and spit out optimal code for any chosen transpose.

>
> Again, I am just pointing to a potential direction for
> research/thought.

I am aware of the things I am currently leaving out, but sometimes I forget
exactly why feature X did not make the cut.

MitchAlsup

unread,

Mar 3, 2020, 9:41:40 PM3/3/20

to

As another example where some clever heuristic is better than brute force.

lkcl

unread,

Mar 4, 2020, 7:47:31 AM3/4/20

to

On Tuesday, March 3, 2020 at 5:06:16 PM UTC, MitchAlsup wrote:
> On Tuesday, March 3, 2020 at 6:58:23 AM UTC-6, lkcl wrote:
> > On Friday, February 28, 2020 at 10:52:34 PM UTC, MitchAlsup wrote:
> >
> > >
> > > I am not exactly sure how to figure out that short string copies are
> > > detected so they end up in the data cache while long string copies
> > > avoid damaging the data cache. One thought train is that if the string
> > > ends up smaller than 1 (or 2) cache lines in length, it goes into
> > > the data cache and if it is longer, it goes into the memory hierarchy.
> >
> > so let me get this straight: you're effectively talking about *bypassing* the normal L1 cache rules, for data-dependent vector loops, is that correct?
>
> Yes, I want inbound data that does not hit in the L1 to stream in from
> wherever, and I would like outbound data that did not hit in the
> cache to stream out to wherever.
> >
> > (also by identifying these loops cleanly and in easily predictable fashion, not needing them to be branch-predicted, like other DSP-style zero-overhead loop instructions)
>
> This is what the VEC and LOOP instructions do.
> >
> > how do things work when memory accesses occur *after* the vecloop copy, both to the data being copied and the target?
>
> Those take a miss on the cache and go looking for the data through the
> memory hierarchy.

even the reads on the data that was *read* by the vector loop?

if the string (or whatever) was, say, only 64 bytes long (4 cache lines) or even say 128 or 256, the risk is that the scheme woild increase latency by accident on systrms that were expecting to refer to that (short-ish) data again.

this would not be the case in say 3D where data was typically "read once, process, write once" in parallel, millions of individual blocks

however general purpose code i think someone mentioned (probably you mitch) you have 30% re-referral to previously accessed memory.

> The Loop count is checked to see if the Stores will completely fill a
> cache line (or more) and if so, then the cache line is filled and when
> filled it is broadcast to the system as Write-Invalidate.

ok so that would make sure that the written data was, if accessed by scalar code, properly consistent.

again however i wonder if general purpose code (not the normal kinds of vector lworkloads) would be less performant as a result.

l.

l.

Terje Mathisen

unread,

Mar 4, 2020, 10:33:26 AM3/4/20

to

Anton Ertl wrote:
> an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>> Terje Mathisen <terje.m...@tmsw.no> writes:
>>> If you check the speed on modern cpus it is quite typical that all
>>> misaligned stores within a cache line carries no extra cost, those that
>>> straddle two lines take an extra cycle, while some cpus have additional
>>> (sometimes large!) cost when straddling a page boundary.
>>>
>>> ISTR Agner Fog having some detailed measurements here?
>>
>> I have not found them, and not much recent stuff otherwise, either. I
>> guess I should measure it, but I don't have the time at the moment.
>
> I spent the time anyway. You can find the results at:
> http://www.complang.tuwien.ac.at/anton/unaligned-stores/:

Great, I'll make a note of that url. :-)

>
> Concerning your claims:
>
> 1) On Intel CPUs misaligned stores within a cache line do indeed carry
> no extra cost, but on AMD CPUs, they do.
>
> 2) Unaligned cross-cache-line stores cost 2 cycles extra on most CPUs
> (compared to aligned stores); exceptions are Skylake and Zen2, with
> one cycle penalty.
>
> 3) Page-crossing stores have a large cost on all CPUs, from 23-24
> cycles extra on Haswell, Skylake, Zen and Zen2 to about 200 cycles on
> Sandy Bridge and Ivy Bridge.

So this was pretty much in line with my recollection, thanks!

Terje

>
> By contrast, the penalties for unaligned loads are much lower on K8 (3
> cycles for page crossing), Sandy Bridge and Ivy Bridge (24-28 cycles)
> <2015Mar3...@mips.complang.tuwien.ac.at>. I should repeat these
> measurements on more modern hardware. Well, not today.
>
> - anton
>

--

MitchAlsup

unread,

Mar 4, 2020, 5:04:38 PM3/4/20

to

On Wednesday, March 4, 2020 at 1:47:31 AM UTC-6, lkcl wrote:
> On Tuesday, March 3, 2020 at 5:06:16 PM UTC, MitchAlsup wrote:
> > On Tuesday, March 3, 2020 at 6:58:23 AM UTC-6, lkcl wrote:
> > > On Friday, February 28, 2020 at 10:52:34 PM UTC, MitchAlsup wrote:
> > >
> > > >
> > > > I am not exactly sure how to figure out that short string copies are
> > > > detected so they end up in the data cache while long string copies
> > > > avoid damaging the data cache. One thought train is that if the string
> > > > ends up smaller than 1 (or 2) cache lines in length, it goes into
> > > > the data cache and if it is longer, it goes into the memory hierarchy.
> > >
> > > so let me get this straight: you're effectively talking about *bypassing* the normal L1 cache rules, for data-dependent vector loops, is that correct?
> >
> > Yes, I want inbound data that does not hit in the L1 to stream in from
> > wherever, and I would like outbound data that did not hit in the
> > cache to stream out to wherever.
> > >
> > > (also by identifying these loops cleanly and in easily predictable fashion, not needing them to be branch-predicted, like other DSP-style zero-overhead loop instructions)
> >
> > This is what the VEC and LOOP instructions do.
> > >
> > > how do things work when memory accesses occur *after* the vecloop copy, both to the data being copied and the target?
> >
> > Those take a miss on the cache and go looking for the data through the
> > memory hierarchy.
>
> even the reads on the data that was *read* by the vector loop?
>
> if the string (or whatever) was, say, only 64 bytes long (4 cache lines) or even say 128 or 256, the risk is that the scheme woild increase latency by accident on systrms that were expecting to refer to that (short-ish) data again.

One could put the first cache line (64 bytes) of data "in" the cache,
which would handle the vast majority of symbol table uses.

>
> this would not be the case in say 3D where data was typically "read once, process, write once" in parallel, millions of individual blocks
>
> however general purpose code i think someone mentioned (probably you mitch) you have 30% re-referral to previously accessed memory.

In very general, calculated or read up data is used 1.1 times.
In very general, calculated or read up data in a loop is used 1.02 times.

>
> > The Loop count is checked to see if the Stores will completely fill a
> > cache line (or more) and if so, then the cache line is filled and when
> > filled it is broadcast to the system as Write-Invalidate.
>
> ok so that would make sure that the written data was, if accessed by scalar code, properly consistent.

Otherwise its no good.

>
> again however i wonder if general purpose code (not the normal kinds of vector lworkloads) would be less performant as a result.

With a good chance of being found in L2 or L3 it probably does not
mater a lot, remember we already saved having to "get write permission"
so we are ahead of the game locally.
>
> l.
>
>
>
> l.

Kent Dickey

unread,

Mar 8, 2020, 3:13:29 AM3/8/20

to

In article <r3lnfj$bi3$1...@gioia.aioe.org>,

Terje Mathisen <terje.m...@tmsw.no> wrote:
>lkcl wrote:
>> On Friday, February 28, 2020 at 10:52:34 PM UTC, MitchAlsup wrote:
>>
>>>
>>> I am not exactly sure how to figure out that short string copies
>>> are detected so they end up in the data cache while long string
>>> copies avoid damaging the data cache. One thought train is that if
>>> the string ends up smaller than 1 (or 2) cache lines in length, it
>>> goes into the data cache and if it is longer, it goes into the
>>> memory hierarchy.
>>
>> so let me get this straight: you're effectively talking about
>> *bypassing* the normal L1 cache rules, for data-dependent vector
>> loops, is that correct?
>
>This seems very similar to the *NT load/store ops in x86, which will use
>a cache line sized streaming buffer but not write this back to the
>normal caches.

Let me point out something ARM CPU's do: store streaming. If the CPU
detects 2-3 cachelines of stores, where every byte on the cachelines is
written, then it switches to streaming mode for the next address. In
streaming mode, the CPU no longer writes the data to L1 cache, and instead
issues bulk writes directly to memory (or, to an outer cache level).

This is likely done completely in the store buffer--once a few cacheline
requests have been seen and the CPU still doesn't have ownership, it
can see that more stores are coming in. And so it can switch to no longer
writing to cache. And if the stores stop, and don't finish out a line,
the data is still there in the store buffer, so it just gives up waiting for
more stores, and just engages the usual store buffer draining policy.
ARM coherency protocol allows a full cacheline write-and-invalidate which
other CPUs treat as an Invalidation request, and they toss any data in their
cache for that address. Resolving ordering in case of conflict is just part
of normal coherency.

This mechanism is independent of any actual loops, and so can be applied
to any CPU. It's a good idea for accelerating large bulk copies since
it dynamically detects them.

Detecting load streams is harder. ARM handles this with hints on the loads.
Loads can be marked as non-temporal, where software says it won't be used
again. I think some CPUs just always use cache way 0 for these loads,
which is an easy way to prevent polluting the cache while still using the
same cache.

But you can ignore the vectors and do a load stream detection. Let's
assume you have a small side cache of like 2KB or so, in parallel to your
real L1 data cache, and it's basically fully associative. Let's assume
you use this for Spectre mitigation as well. All loads go to this side
cache (we could call it a Prefetch-Miss cache, like the PA7200 I worked on
25 years ago). And in roughly FIFO order, let the oldest lines from this
cache get moves to the L1 data cache, once speculation is resolved (this
is the Spectre mitigation part).

Now, you just need a load-stream detection logic--if code is striding through
touching most data on the line, then mark that line to NOT get moved to the
L1 cache if the stream continues for 4-5 cache lines.

Again, this can be independent of VEC/LOOP, and it applies to all code. It
still pollutes the outer caches, though. That can be addressed by once the
stream is detected, use the ARM "replace to way 0" trick for the outer caches
as well.

Kent

MitchAlsup

unread,

Mar 8, 2020, 10:21:45 PM3/8/20

to

On Saturday, March 7, 2020 at 9:13:29 PM UTC-6, Kent Dickey wrote:
> In article <r3lnfj$bi3$1...@gioia.aioe.org>,
> Terje Mathisen <terje.m...@tmsw.no> wrote:
> >lkcl wrote:
> >> On Friday, February 28, 2020 at 10:52:34 PM UTC, MitchAlsup wrote:
> >>
> >>>
> >>> I am not exactly sure how to figure out that short string copies
> >>> are detected so they end up in the data cache while long string
> >>> copies avoid damaging the data cache. One thought train is that if
> >>> the string ends up smaller than 1 (or 2) cache lines in length, it
> >>> goes into the data cache and if it is longer, it goes into the
> >>> memory hierarchy.
> >>
> >> so let me get this straight: you're effectively talking about
> >> *bypassing* the normal L1 cache rules, for data-dependent vector
> >> loops, is that correct?
> >
> >This seems very similar to the *NT load/store ops in x86, which will use
> >a cache line sized streaming buffer but not write this back to the
> >normal caches.
>
> Let me point out something ARM CPU's do: store streaming. If the CPU
> detects 2-3 cachelines of stores, where every byte on the cachelines is
> written, then it switches to streaming mode for the next address. In
> streaming mode, the CPU no longer writes the data to L1 cache, and instead
> issues bulk writes directly to memory (or, to an outer cache level).

Good to know

>
> This is likely done completely in the store buffer--once a few cacheline
> requests have been seen and the CPU still doesn't have ownership, it
> can see that more stores are coming in. And so it can switch to no longer
> writing to cache. And if the stores stop, and don't finish out a line,
> the data is still there in the store buffer, so it just gives up waiting for
> more stores, and just engages the usual store buffer draining policy.

VEC/LOOP enhance the ability to recognize the situation by making the
code compact with known properties (such as it IS a loop).

> ARM coherency protocol allows a full cacheline write-and-invalidate which
> other CPUs treat as an Invalidation request, and they toss any data in their
> cache for that address. Resolving ordering in case of conflict is just part
> of normal coherency.
>
> This mechanism is independent of any actual loops, and so can be applied
> to any CPU. It's a good idea for accelerating large bulk copies since
> it dynamically detects them.

I would like to know how often a cache line is filled up (or completely
consumed) without a loop being present. My hunch is "not very often".

>
> Detecting load streams is harder. ARM handles this with hints on the loads.
> Loads can be marked as non-temporal, where software says it won't be used
> again. I think some CPUs just always use cache way 0 for these loads,
> which is an easy way to prevent polluting the cache while still using the
> same cache.

I think VEC/LOOP can be used as said hint.

>
> But you can ignore the vectors and do a load stream detection. Let's
> assume you have a small side cache of like 2KB or so, in parallel to your

I would assume this side cache (or L0) is 8 cache lines (512 Bytes) for
lower end machines and maybe 2× for medium machines and 4× for Big ones.

> real L1 data cache, and it's basically fully associative. Let's assume
> you use this for Spectre mitigation as well. All loads go to this side
> cache (we could call it a Prefetch-Miss cache, like the PA7200 I worked on
> 25 years ago). And in roughly FIFO order, let the oldest lines from this
> cache get moves to the L1 data cache, once speculation is resolved (this
> is the Spectre mitigation part).
>
> Now, you just need a load-stream detection logic--if code is striding through
> touching most data on the line, then mark that line to NOT get moved to the
> L1 cache if the stream continues for 4-5 cache lines.
>
> Again, this can be independent of VEC/LOOP, and it applies to all code.

The question remains, is there more to be gained within a Loop, or from without ?

Ivan Godard

unread,

Mar 8, 2020, 11:29:43 PM3/8/20

to

Actually quite common, during initialization of larger class or
structure objects.

A lot of the discussion on this board is oriented toward math-like
operations on numeric arrays, to the neglect of the non-math computation
of data processing. IMO, ISAs should pay more attention to the needs of
apps whose major expensive data structure is an array of payroll
records. Not everything is linear algebra!

The schemes suggested in this thread tend to fail for class/struct
compute. One problem from such needs is that the initialization is
explicit, not looping. Another is that the address order is dictated by
the language and is essentially random vis a vis the hardware, rather
than consecutive as require in some schemes here. However, the biggest
problem is pad bytes: the initialization code will skip those, but the
hardware doesn't know they are meaningless don't-care positions and will
busily load the underlying line and merge in the new data.

Mill's valid bits in the caches solves the problem somewhat: it handles
random memory order as well as consecutive; doesn't care if the init is
a loop or random code; lets the app use the data of a partially filled
line without forcing a load from DRAM; and completely eliminates store
buffers and all the associated hardware. As most transient data never
leaves the caches in Mill, the line can tolerate uninitialized pad
bytes. It can also tolerate holes in newly allocated memory, because
Mill virtual-zero will cause the holes to be zero without needing a
read, or page zero-init for that matter, while Mill implicit-zero does
the same for the stack.

However, if a line with pad bytes (or other holes in the data that are
not on top of virtual/implicit zeroes) ever has to be evicted then the
hardware does have to do a load/merge sequence. We plan to upstream LLVM
patches so the compiler will expose the padding to the code generator so
initializing the pads can be explicit, saving the load/merge at the cost
of some extra store ops. We won't do the upstream until a lot of
measurement that we haven't done yet.

Stephen Fuld

unread,

Mar 9, 2020, 6:05:43 AM3/9/20

to

On 3/8/2020 4:29 PM, Ivan Godard wrote:

snip

> A lot of the discussion on this board is oriented toward math-like
> operations on numeric arrays, to the neglect of the non-math computation
> of data processing. IMO, ISAs should pay more attention to the needs of
> apps whose major expensive data structure is an array of payroll
> records. Not everything is linear algebra!

I absolutely agree. But I think that part of the problem is that the
code for common linear algebra operations is well known, whereas the
code for less numerical algorithms that are important for commercial
data processing, such as the inner loop of a database search, tends to
be proprietary, or at least varies a lot among different
implementations. This makes it harder to figure out common things to
optimize in the hardware. Though I do take your point about, and
applaud, several Mill features you describe.

Also, note that Mitch's new enhancements, the original subject of this
thread is aimed at things like string copy, not linear algebra. That is
a good thing. :-)

MitchAlsup

unread,

Mar 9, 2020, 4:02:20 PM3/9/20

to

Those features also set the stage for automatic translation of linear code
into SIMD codes (or at least multi-lane vector calculations.)

Stefan Monnier

unread,

Mar 11, 2020, 9:25:56 PM3/11/20

to

> I absolutely agree. But I think that part of the problem is that the code
> for common linear algebra operations is well known, whereas the code for
> less numerical algorithms that are important for commercial data processing,
> such as the inner loop of a database search, tends to be proprietary, or at
> least varies a lot among different implementations. This makes it harder to
> figure out common things to optimize in the hardware.

There's still a fair bit of Free Software available that can be used as
benchmark in this respect (i.e. for non-numerical computations), such as
gcc or mysql.

Stefan

Stephen Fuld

unread,

Mar 11, 2020, 10:32:56 PM3/11/20

to

Agreed. But I don't think optimizing for the performance of a compiler
is a huge win, unless you can abstract out some commonly used
functionality. Mitch mentioned symbol table stuff, and perhaps that works.

As for MySQL, that is the kind of application that uses lots of cycles,
and perhaps the inner loop of the Select statement processing, i.e.
searching, could benefit. I just don't know how generic the code used
in MYSQL is, that is, would a vector enhancement that works for MySQL
also work for say Oracle or DB2?

Stephen Fuld

unread,

Mar 12, 2020, 12:12:31 AM3/12/20

to

I don't really understand what you mean here, but I look forward to
finding out as you make progress on it. :-)

Terje Mathisen

unread,

Mar 12, 2020, 9:09:53 PM3/12/20

to

All I have ever heard about SQL DBs is that the typical main loop is
_really_ large, enough to blow out more or less any code cache?

OTOH, there are implementations which claim significantly higher
performance, it is tempting to guess they are doing similar stuff to
graphics pipelines, i.e. a small internal JITer which compiles the
current statement into something which is a lot more compact.

Stephen Fuld

unread,

Mar 13, 2020, 6:41:00 PM3/13/20

to

I haven't looked at any code, but that certainly seems plausible. Which
means that VVM is probably not applicable here.

One non linear algebra, commercially oriented thing where VVM might help
is AES encryption. I know Intel at least has instructions to aid this,
but I don't know if they are so slow that the memory streaming feature
Mitch just disclosed wouldn't help.

But even if they aren't helped by VVM, their existence is an example of
something in an ISA designed especially for something besides linear
algebra.

MitchAlsup

unread,

Mar 13, 2020, 8:29:24 PM3/13/20

to

The question I have, here, is whether the "main loop" is the innermost loop?
a) VVM is designed for innermost loops only
b) VVM is restricted to loops smaller than 32 instructions in length

(b) is part of allowing reasonable VVM implementations on small in order
design points without restricting the performance of GBOoO implementations.

>
> One non linear algebra, commercially oriented thing where VVM might help
> is AES encryption. I know Intel at least has instructions to aid this,
> but I don't know if they are so slow that the memory streaming feature
> Mitch just disclosed wouldn't help.
>
> But even if they aren't helped by VVM, their existence is an example of
> something in an ISA designed especially for something besides linear
> algebra.

VVM can to Gather/Scatter by having one load produce the required data
for a subsequent load (Gather) or store (Scatter).

VVM can vectorize loops with If-Then-Else using predication

Vector loops in VVM have access to the ENTIRE instruction set of My 66000

Stephen Fuld

unread,

Mar 21, 2020, 9:53:45 PM3/21/20

to

On 2/28/2020 2:52 PM, MitchAlsup wrote:

> As I have discussed here before, I have invented a means to process
> instructions quickly on not-very-wide machines called the "Virtual Vector
> Method". Today's topic concerns some "other" stuff one may want to do when
> processing Vector Loops that would be unavailable when one is processing
> scalar instructions (even several at a time.)
>
> Vector Processing is KNOWN to do "bad" tings to data caches, generally strip
> mining one or more vectors through the cache, leaving no footprint of other
> data which may be of higher importance after the vector calculations have
> completed. Today's topic should alleviate most of this degradation.
>
> Vector loops come in two (2) flavors: counted loops, and conditional loops.
> Counted loops are of the classical form:
>
> for( i = 0; i < MAX; i++ )
>
> Whereas Conditional loops are of the form:
>
> for( i = 0; p[i]; i++ )
>
> In the counted loop version, it would be easy for the processor to examine
> the MAX variable at Loop-setup time and determine if the loop was "long"
> (say more trips than "page" size) or not "long". Should the loop be long,
> Loads and Stores can be processed with modified semantics.
>
> In particular, Loads which take a miss in the data cache can be brought into
> some memory buffer and forwarded into calculation from there without ever
> being written into the data cache and without incurring any cycles of additional latency that would not have already occurred (and probably
> saving some cycles.)
>
> Similarly, Stores which take a miss in the Data Cache, where it is KNOWN
> that every byte of he cache line will be written, can collect their
> data and Write-Invalidate it beyond the Data Cache without incurring any
> cycles of additional latency; indeed, this eliminates the requirement
> that Stores which miss the Data Cache Read the data into the cache and
> obtain Write-Permission (E or M state) prior to storing new data.
>
> Now, taken together; consider the following loop::
>
> {
> char from[*], to[*];
> for( i = 0; i < count; i++ )
> to[i] = from[i];
> }
>
> A) this is a counted loop
> B) in this case, to[*] and From[*] are page aligned data
> C) VVM is implemented.
>
> Hardware notices that the Loaded data is forwarded to Store Data without
> ever being used, and thus, the from[*] data is not needed in the Data Cache.

Let me complicate this a bit. Suppose you are doing a move with
translate. That inserts a load, indexed by the first load's value
between the load and the store.

So questions.

I presume you can't "coalesce" the loads as you do on the straight byte
move, but can you fill up the load buffer on the first load so the loads
for the subsequent bytes get loaded from it rather than the L1 cache?

I am guessing that the second loads (the translate ones) have to be
handled as individual bytes, but does the VVM hardware automatically do
this on multiple FUs if they are available?

Do the stores get "coalesced" so only one store, of the whole half cache
line, gets done?

Note: I hope you don't mind that I may be "pushing" you beyond where
you want to go.

MitchAlsup

unread,

Mar 22, 2020, 5:06:53 PM3/22/20

to

Yes, the HW notices the first LD is a dense byte-sized walk over a cache
line. The second load is not a dense walk. The third store is another
dense walk.

>
> I am guessing that the second loads (the translate ones) have to be
> handled as individual bytes, but does the VVM hardware automatically do
> this on multiple FUs if they are available?

In the implementation I am envisioning, there are a number of buffers
(not so different from an L0 cache) on a low end machine the number might
be 4 buffers. 1st Ld takes one buffer, 3rd St takes another, leaving
2 buffers of 64-bytes each. If the translation pattern primarily falls
such that 128 bytes will translate the LDs into the STs, then there
will be no additional delay from accessing the memory hierarchy
(otherwise there will).

This covers the case of "printable" ASCII but not of the chars with
16-bit representations.

How many bytes are translated per cycle is probably 1 on the lower
end model I am more actively pursuing. So::

MOV R5,#0
loop:
VEC 0B10000
LDUB R6,[R1+R5]
LDUB R7,[R2+R6]
STB R7,[R3+R5]
LOOPLT R5,#1,R4 // ADD-CMP-BLT

This vectorized loop should run at 2-cycles per loop or the equivalent of
3 IPC.

>
> Do the stores get "coalesced" so only one store, of the whole half cache
> line, gets done?

The HW recognizes the dense walk over a cache line and places the trans-
lated bytes in the St's buffer and when the buffer is filled, it is
written into the L1 cache in one activity. Should the L1 be 1/2 line
at a time, this will take 2 back-to-back cycles.

Terje Mathisen

unread,

Mar 22, 2020, 5:31:55 PM3/22/20

to

MitchAlsup wrote:
> On Saturday, March 21, 2020 at 4:53:45 PM UTC-5, Stephen Fuld wrote:
>> Let me complicate this a bit. Suppose you are doing a move with
>> translate. That inserts a load, indexed by the first load's value
>> between the load and the store.
>>
>> So questions.
>>
>> I presume you can't "coalesce" the loads as you do on the straight byte
>> move, but can you fill up the load buffer on the first load so the loads
>> for the subsequent bytes get loaded from it rather than the L1 cache?
>
> Yes, the HW notices the first LD is a dense byte-sized walk over a cache
> line. The second load is not a dense walk. The third store is another
> dense walk.
>>
>> I am guessing that the second loads (the translate ones) have to be
>> handled as individual bytes, but does the VVM hardware automatically do
>> this on multiple FUs if they are available?
>
> In the implementation I am envisioning, there are a number of buffers
> (not so different from an L0 cache) on a low end machine the number might
> be 4 buffers. 1st Ld takes one buffer, 3rd St takes another, leaving
> 2 buffers of 64-bytes each. If the translation pattern primarily falls
> such that 128 bytes will translate the LDs into the STs, then there
> will be no additional delay from accessing the memory hierarchy
> (otherwise there will).
>
> This covers the case of "printable" ASCII but not of the chars with
> 16-bit representations.

To me this sounds like an opportunity!

If I had an 8kB table, then I would consider doing an initial set of
load/prefetch operations, one per cache line, to get this table into
$L1, then I could use the streaming buffers on the input and output arrays.

Stephen Fuld

unread,

Mar 24, 2020, 7:44:40 PM3/24/20

to

Neat. This is better than what I interpreted your original post to say.
I thought the loads would only be coalesced if the data was sent to
the store unit directly, but now I understand that each load (or store)
is evaluated for "dense walk" independently of what happens to the
loaded data after it is loaded.

MitchAlsup

unread,

Mar 24, 2020, 9:34:23 PM3/24/20

to

Another way to think about the method is that the scalar registers are
used during vector setup to control the data-flow dependencies, but
afterwards there are no registers just results and landing zones (for
results).

A processor can devote as much or as little resources as fits their target
design cost/application/price/.....

MitchAlsup

unread,

Mar 25, 2020, 7:45:57 PM3/25/20

to

A follow up thought::

The first load understands that the associated buffer can supply data for
the next K Loops and that this (dense) load only has to run a few times
every 2**k loop iterations. So does the third store.

So, in effect the first load of a cache line satisfies a large number
of future load instructions which will be "satisfied" from the data
already present, and under the current Cache Tag and TLB states. Thus,
these need not be re-run until the line boundary gets crossed.

The second load gets no such understanding and has to be "performed"
once each loop iteration.

So, in effect, by separating the "AGEN" part from the "get the data" part,
we "perform" the semantics required without actually executing all of the
instruction. We still get to could the instruction as having been executed,
even when we short circuit some steps along the way.

Stephen Fuld

unread,

Mar 25, 2020, 8:59:01 PM3/25/20

to

Yup, I get it. A nice "fallout" of this is that it is load size
independent. That is, it should "automagically" work for say adding 1
to each element of a 16 bit per element array, or adding two single
precision floating point arrays.

You said that in my translate example, you would probably only do one
"translate load" per cycle. Is that a restriction based on the number
of load store units or something else. Or to put it another way, in the
add one to each element of an integer array example, would there be more
than one element added to in each cycle?

MitchAlsup

unread,

Mar 25, 2020, 9:55:34 PM3/25/20

to

Another fallout is that the memory traffic does not have to be "every byte"
dense, one could skip every other byte and still recognize that the next
several loads have their data already available (without even looking).

>
> You said that in my translate example, you would probably only do one
> "translate load" per cycle. Is that a restriction based on the number
> of load store units or something else. Or to put it another way, in the
> add one to each element of an integer array example, would there be more
> than one element added to in each cycle?

The limitation has to do with the scale of the machine I envision--
which is a lot closer to a 1.3-wide machine than a 2-wide SuperScalar
machine. A bigger machine would have little difficulty applying more
resources to make this run faster.

Quadibloc

unread,

Mar 27, 2020, 6:13:31 AM3/27/20

to

I am glad that you are working on something to help improve the processing of
vectors.

My own understanding of these technical questions is limited, however. It seems to
me that if the issue is that vector calculations needlessly disturb the cache, a
solution would be - in old-style technology, at least - to put the vector
calculations in a co-processor separate from the scalar processor, not using its
cache.

My own interest in vectors comes from a different direction.

As is well-known, a typical GPU can process many more floating-point operations
per second than a conventional CPU. This has led to GPU computing.

But GPUs have a highly specialized design. So even if a computational problem is
highly vectorizable, if what you're using to solve it is a CPU-GPU hybrid
system, the amount that can be sped up using the GPU is limited.

The NEC SX-Aurora TSUBASA is, as far as I know, the last surviving example,
among systems directed at high-performance computing, of a computer the
architecture of which is based on the concepts employed so successfully in the
Cray I and its successors.

It can't do _quite_ as many 64-bit FLOPS as a high-end GPU card, but because
it's not a GPU, but a vector computer, it is much more flexible. This would seem
to mean that it could speed up a larger proportion of a scientific problem,
making it a "sweet spot" for scientific HPC.

It achieves high memory bandwidth by packing 48 gigabytes of memory in the same
module as the processor using HBM. The module is a card that doesn't look much
different from a video card, so a host x86 system handles I/O for the vector
processor.

So I wonder why this system isn't getting more recognition and wider adoption.
If, as it seems to me _should_ be the case, it runs real-world scientific
problems more effectively than any other kind of system available, surely it
should gain wide recognition.

One possible cause is that it's a Japanese product, and some sources of funding
for supercomputers will include a condition of developing the domestic
supercomputer industry. Given that the patents on the CRAY-I have expired,
though, and indeed back in the day IBM, Univac, and DEC all made vector add-ons
for their big computers of a similar design, why wouldn't some American company
get TSMC to make vector processors for it?

However, while 48 gigabytes of RAM is a big memory for a home PC, a
supercomputer working on large problems needs access to more. The NEC systems
would consist of numerous x86 servers, each with several vector cards, in a
high-speed network. But does that really let them work together in an effective
way? Maybe the disadvantages of using GPU cards instead are offset if, say,
CRAY's networking tech were better.

Basically, after that first 48 gigabytes of RAM, it seems you might fall off a
cliff, because then you're depending on the x86 server to supply you with
everything outside that - in software. If the vector processors could directly
access the network - or even just the other vector processors in the same server
(like those connectors used to connect two video cards together, so each one can
handle half the screen) - one potential weakness might be addressed.

Also, in my naivete, since after the 360/195, the next big step in computing in
the old days was the Cray-I, and microcomputers have retraced the evolution of
discrete component and small-scale IC systems from small minicomputers like the
PDP-8 up to the 360/195, it seems like the only remaining _known_ architectural
step already proven in practice for higher-performance systems is to follow the
trail blazed by the Cray-I and its successors.

No doubt my reasoning could be flawed.

But it should be clear from this why it seems apparent to me that a classic vector computer seems to provide an architectural model for increasing the potential maximum performance available from computers.

Indeed, given that inexpensive computer chips often include small-scale GPUs on
them to allow making systems that can play games without buying a separate video
card, and the EPYC server chips from AMD have enough pins to support eight
channel memory (the SX-9 connected sixteen DRAM modules in parallel to its
processors), it _seems_ to me that the technology exists to make even a mass-
market product of this type (although with less performance than the NEC
system).

Given what seems to be an insatiable appetite for more computer power out there,
I can only wonder why it isn't happening.

John Savard

Terje Mathisen

unread,

Mar 27, 2020, 6:51:16 AM3/27/20

to

Quadibloc wrote:
> I am glad that you are working on something to help improve the processing of
> vectors.
>
> My own understanding of these technical questions is limited, however. It seems to
> me that if the issue is that vector calculations needlessly disturb the cache, a
> solution would be - in old-style technology, at least - to put the vector
> calculations in a co-processor separate from the scalar processor, not using its
> cache.
>
> My own interest in vectors comes from a different direction.
>
> As is well-known, a typical GPU can process many more floating-point operations
> per second than a conventional CPU. This has led to GPU computing.
>
> But GPUs have a highly specialized design. So even if a computational problem is
> highly vectorizable, if what you're using to solve it is a CPU-GPU hybrid
> system, the amount that can be sped up using the GPU is limited.
>
> The NEC SX-Aurora TSUBASA is, as far as I know, the last surviving example,
> among systems directed at high-performance computing, of a computer the
> architecture of which is based on the concepts employed so successfully in the
> Cray I and its successors.
>
> It can't do _quite_ as many 64-bit FLOPS as a high-end GPU card, but because
> it's not a GPU, but a vector computer, it is much more flexible. This would seem
> to mean that it could speed up a larger proportion of a scientific problem,
> making it a "sweet spot" for scientific HPC.

Over the last 10+ years, the most FP bang for your buck you could get in
a regularly programmable cpu might have been the Larrabee descendants,
but they are still targeting 32-bit more than 64.

Ivan Godard

unread,

Mar 27, 2020, 3:13:27 PM3/27/20

to

Fancy that. It's called "LOAD" and "PICKUP". Be aware that there are
complications when you have loaded more than (it turns out) you will
pick up; reclaiming the buffer and all.