Ultrasparc block load/store -- what is optimal?

0 views
Skip to first unread message

Chris Torek

unread,
Apr 13, 2000, 3:00:00 AM4/13/00
to
Background: Ultrasparc has block load/store operations (as part of
the VIS instruction set) that load or store 64 bytes into the FPU
register file. These must be on a 64-byte aligned address, both
in memory and in the register file. There is a separate mechanism
for doing byte realignment. Both the load and store avoid allocating
in the caches (L1 and L2), which of course is important to avoid
wiping them out when doing large copies.

There are severe constraints on references to the target registers.
I find the wording in the Ultrasparc User's Manuals confusing:

To simpifly the implementation, BLD destination registers may
or may not interlock like ordinary load instructions. Before
referencing the block load data, a second BLD (to a different
set of registers) or a MEMBAR #Sync must be performed.

So far, so good -- if you want to use block load, you need:

BLD [addr_zorch], reg_blk_zorch
BLD [addr_blogg], reg_blk_blogg

to occur in that order (since -- presumably -- a full MEMBAR will
kill performance). But read the next lines:

If a second BLD is used to synchronize with returning data,
then Ultrasparc-IIi [nb: I am using the IIi book here] continues
execution before all data has been returned. The lowest number
register being loaded may be referenced in the first instruction
group following the second BLD, the second lowest number register
may be referenced in the second group, and so on. If this rule
is violated, data from before or after the load may be returned.

The intent is clear enough: BLD simply starts streaming the data
in, and if there are any ready-to-read interlocks on the registers,
they are stepped along by "instruction groups". The first problem
is: "the lowest number register" is, in effect, a dangling reference.
Does this mean "the lowest number register of the *first* BLD" --
set "zorch", above -- or does it mean "the lowest number register
of the *second* BLD"? In other words, given the above two BLDs, are
we allowed to:

FMOVD third_reg_in_zorch, safe_place

or are we limited only to "first_reg_in_zorch"?

Given the way I suspect the hardware works inside, the latter would
make little sense, so I have presumed that they mean the former.

The wording for the constraints on the registers of a block store
is unambiguous (but slightly different: you can *overwrite*, not
just "refer to", the first register, in the first group following
the BST).

Ignoring all of these constraints for a moment, the "obvious" way to
copy a large block of memory is with a sequence of alternating loads
and stores:

loop for n := 0 to nblks - 1 do
BLD [src], regs
BST regs, [dst]
src += 64
dst += 64
end loop

Clearly, however, there is an alternative that might perform better:

BLD [src], regs0 ! load block 0
src += 64 ! from offset 0

BLD [src], regs1 ! load block 1
src += 64 ! from offset 1

! assuming nblks is even, and ignoring boundary problems and
! the referencing constraints ("lowest num reg in first group" etc):

loop for n := 2 step 2 to nblks - 1 do
BST [dst], regs0 ! now that block n-2 has arrived,
dst += 64 ! store it

BLD [src], regs0 ! begin streaming in block n
src += 64

BST [dst], regs1 ! now that block n-1 has arrived,
dst += 64 ! store it

BLD [src], regs1 ! begin streaming in block n+1
src += 64
end loop

! at this point, blocks n-2 and n-1 are still sitting in
! regs0 and regs1 waiting to be stored

BST [dst], regs0 ! block n-2 arrived, so store it
dst += 64

! Bug: regs1 requires a MEMBAR or a "dummy" BLD
BST [dst], regs1 ! and store block n-1
dst += 64 ! for symmetry

This loop matches (to a large extent) the code example in the User's
Manual (except that the example uses "faligndata" as well, which
neatly fills in the needed instruction groups as well).

Now, we could just stop here and say, voila, one load streams into
one register-block while one store streams out of the other. But
then I found this white paper on a Sun web site:

http://www.sun.com/microelectronics/whitepapers/wpr-0001/index.html

which contains the following:


UltraSPARC-I was limited to a maximum of one coherent read and
one dirty-victim writeback request outstanding to the memory
subsystem. ... UltraSPARC-II addresses this problem
by supporting multiple outstanding requests. When a read request
is sent out to the memory subsystem, the system does not block
or stop executing if a second request has to be sent out.
Instead it can continue to execute with up to three read requests
and two dirty victim writeback requests outstanding to the
memory subsystem. A fourth read or a third writeback will cause
the processor to stop execution.

If this applies to block load, then instead of two BLDs, the loop
should be primed with three:

! first set up three outstanding loads
BLD [src], regs0; src += 64
BLD [src], regs1; src += 64
BLD [src], regs2; src += 64

! then start storing-and-loading in a loop
loop for n := 3 step 3 to nblks - 1 do
BST [dst], regs0; dst += 64
BLD [src], regs0; src += 64

BST [dst], regs1; dst += 64
BLD [src], regs1; src += 64

BST [dst], regs2; dst += 64
BLD [src], regs2; src += 64
end loop

! and then do the cleanup (omitted)

One of my questions, then, is: do we want the three load version,
or the two load version? Does the answer change for different
Ultrasparc CPUs? If we use the three-load version, does that
eliminate the need to put "dummy" references to regs0[0] through
regs0[7], regs1[0]..regs1[7], etc., inside the loop in the "fully
aligned" cases (e.g., when copying a page-aligned 1MB buffer)?

Ignoring that question for now, let us look at what else goes on
in (at least some) actual copies. We need to consider how the
FALIGNDATA instruction is used to handle "bad" alignments. Suppose
the destination address is 64-byte aligned, but the source address
is 63 mod 64, i.e., we want to copy bytes as follows (where the
bytes are numbered in hexadecimal):

src dst
+-----------------------+ +-----------------------+
| | | | | | | | | |00|01|02|03|04|05|06|07|
+-----------------------+ +-----------------------+
| | | | | | | | | |08|09|0a|0b|0c|0d|0e|0f|
+-----------------------+ +-----------------------+
| | | | | | | | | |..|..|..|..|..|..|..|..|
+-----------------------+ +-----------------------+
| | | | | | | |00| |38|39|3a|3b|3c|3d|3e|3f|
+=======================+ +-----------------------+
|01|02|03|04|05|06|07|08| <- next block
+-----------------------+
|..|..|..|..|..|..|..|..|
+-----------------------+
|39|3a|3b|3c|3d|3e|3f| |
+-----------------------+

The FALIGNDATA instruction takes two 8-byte FP registers, combines
them to form a 16-byte string, and selects 8 of those 16 bytes to
write into a third 8-byte FP register. (The bytes are chosen based
on another register, which we can ignore.) In this case, then, we
want to do this to copy exactly one block:

! Call regs0[0] the first 8 bytes, regs0[1] the 2nd, etc.
! Load up the two 64-byte blocks that make up the source bytes.
src -= 63 ! get it aligned
BLD [src], regs0 ! actually we only need regs0[7], but...
BLD [src + 64], regs1

MEMBAR ! for now -- not for performance

! combine last byte from block 0 + 7 bytes from block 1
FALIGNDATA regs0[7], regs1[0], regs2[0]

! and repeat with each set of bytes from block 1
FALIGNDATA regs1[0], regs1[1], regs2[1]
FALIGNDATA regs1[1], regs1[2], regs2[2]
FALIGNDATA regs1[2], regs1[3], regs2[3]
FALIGNDATA regs1[3], regs1[4], regs2[4]
FALIGNDATA regs1[4], regs1[5], regs2[5]
FALIGNDATA regs1[5], regs1[6], regs2[6]
FALIGNDATA regs1[6], regs1[7], regs2[7]

! now we have them all, so store them
BST [dst], regs2

If the offset is less extreme -- say, 32 to 39 instead of 63 --
we need more data from the first 64 byte block and less from
the second, and we need a different set of FALIGNDATA instructions:

! (this is actually just "src &= ~63", but we have to
! select one of eight copy loops, so this step happens
! elsewhere -- I include it here purely for illustration)
src -= 32 or 33 or 34 or ... or 39 ! get it aligned
BLD [src], regs0
BLD [src + 64], regs1

MEMBAR ! again, just for now

! Combine bytes from block 0, row 4 with bytes from block 0, row 5.
! Block 0 row 7 combines with row 0 block 1, etc.
FALIGNDATA regs0[4], regs0[5], regs2[0]
FALIGNDATA regs0[5], regs0[6], regs2[1]
FALIGNDATA regs0[6], regs0[7], regs2[2]
FALIGNDATA regs0[7], regs1[0], regs2[3]
FALIGNDATA regs1[0], regs1[1], regs2[4]
FALIGNDATA regs1[1], regs1[2], regs2[5]
FALIGNDATA regs1[2], regs1[3], regs2[6]
FALIGNDATA regs1[3], regs1[4], regs2[7]

BST [dst], regs2

Now let us put this together with the load/ref ordering constraints,
along with the idea that there are many more blocks to copy, so that
as soon as we are done with some register(s), we should use BLD to
load new values.

Remember that I am assuming that the second BLD causes only the
*first* BLD's registers to become reference-able, and then only
regs0[0] can be referenced in the first FALIGNDATA. We need to
access regs0[7] and regs1[0]! To guarantee that we can do that,
we have to prime the whole process with "just enough" FMOVD
instructions replacing the MEMBAR. Here I assume the source offset
is between 32 and 39 bytes:

BLD [src], regs0; src += 64
BLD [src], regs1; src += 64

FMOVD regs0[0], temp ! wait four FP instruction groups
FMOVD regs0[1], temp ! (note: the source and dest of the
FMOVD regs0[2], temp ! fmovd instructions are irrelevant,
FMOVD regs0[3], temp ! but this is easier to read)

loop for n := 2 step 2 to nblks - 1 do
FALIGNDATA regs0[3], regs0[4], regs2[0]
FALIGNDATA regs0[4], regs0[5], regs2[1]
FALIGNDATA regs0[5], regs0[6], regs2[2]
FALIGNDATA regs0[6], regs0[7], regs2[3]

! Start a new block coming in. Must save regs0[7] first!
FMOVD regs0[7], temp
BLD [src], regs0; src += 64

! Now we are allowed to refer to regs1[0] here:
FALIGNDATA temp, regs1[0], regs2[4]
FALIGNDATA regs1[0], regs1[1], regs2[5]
FALIGNDATA regs1[1], regs1[2], regs2[6]
FALIGNDATA regs1[2], regs1[3], regs2[7]

! finally assembled an output block -- start it storing
BST regs2, [dst]; dst += 64

! Here we are allowed to overwrite regs2[0]
FALIGNDATA regs1[3], regs1[4], regs2[0]
FALIGNDATA regs1[4], regs1[5], regs2[1]
FALIGNDATA regs1[5], regs1[6], regs2[2]
FALIGNDATA regs1[6], regs1[7], regs2[3]
FMOVD regs1[7], ftemp ! have to save it

! begin preloading yet another block
BLD [src], regs1; src += 64

! Now we are allowed to refer to regs0[0]
FALIGNDATA temp, regs0[0], regs2[4]
FALIGNDATA regs0[0], regs0[1], regs2[4]
FALIGNDATA regs0[1], regs0[2], regs2[5]
FALIGNDATA regs0[2], regs0[3], regs2[6]
FALIGNDATA regs0[3], regs0[4], regs2[7]

BST regs2, [dst]; dst += 64
end loop

Now, all this is assuming we want to have one block load "in flight"
while we execute fp-group instructions that wait-for-and-read-from
the previous block load. If it is better to have two in flight --
well, that is pretty much the same question I asked earlier.

There is another (trivial) problem remaining. When the loop
finishes, the two register sets (the "arrays" regs0[] and regs1[],
above) are still being loaded and still need to be stored. The
required FALIGNDATA instructions suffice to finish waiting for one
of the blocks, but the other one requires either a "dummy" BLD, so
that we can then do more FALIGNDATAs that meet the "can only
reference registers in block zorch" constraint, or we need a MEMBAR.
Is it "better" (more efficient) to do a "wasted" extra BLD, just
to enable the previous block to synchronize, or is it better to
use a MEMBAR, which grinds everything to a halt? The MEMBAR avoids
the useless bus transaction, but prevents the last block-store from
overlapping nicely.

(Since this is purely mop-up, it does not really make that much
difference.)

Another question: since BLD/BST require using the FP register file,
and the block copy code must be re-entrant, the code has to save
any registers it uses before it begins, then restore them when it
is done. (Imagine the disaster if a block copy were interrupted,
and the interruptee did another block copy, without first saving
the FP registers. This is quite rare in practise but does occur
now and then.) The registers can be stored to the stack, either
with a series of STDF instructions or -- by allocating extra stack
space, aligning on a 64-byte boundary, and using BST and BLD to
write/read them. Those require extra MEMBARs, so: is block-transferring
these worthwhile? In any case, the store-and-load creates extra
memory traffic, so: at what point is the cost of this extra traffic
amortized by the improved copy rate and by the lack of cache
pollution?

Note: I wrote the following code that uses just the integer registers
to do arbitrarily-aligned memory copies using minimal loads and
stores. Timing it, vs the VIS instructions, on 64-byte aligned
copies (and using only two on-the-fly register sets rather than
three, in the VIS code), shows that the VIS instructions are really
only very slightly faster, and then only for large blocks, presumably
due to the cost of saving and restoring the FPU registers (which
I have been doing with regular stores rather than block-store, for
my timings). So I assume the real gains from using BLD/BST to do
block copies, at least for sizes in the 1024 to 65536 byte range or
so, is that it does not overwhelm the rest of the cache.

[%i0 is source, %i1 is destination, %i2 is len]
[these use lda/sta because they are designed for kernel copyin/out as
well as kernel bcopy]
/*
* First check for tiny copies (<= 8 bytes). If we are under
* this limit, just do the copy and go out.
*/
mov 8, %g1 ! always 8 -- for subtracting
and %i0, 7, %i4 ! soff = src & 7;
cmp %i2, 8 ! if (len > 8)
bgu %xcc, L(morethan8) ! goto morethan8;
and %i1, 7, %i5 ! doff = dst & 7;

... some code omitted ...

LABEL(morethan8)
/*
* We have more than 8 bytes to copy. %i4 has soff, %i5 has doff.
* If doff is nonzero, align dst by copying at most 7. This will
* write at most one word (clen = doff, so doff>clen is false).
*/
brz %i5, L(dstaligned)
andn %i0, 7, %o4 ! base = src & ~7;

/* load 7 bytes from <src,soff> */
sub %g1, %i5, %o3 ! a = 8 - doff;
/* NB: clen = 8 - (8 - doff) = doff; len = (8 - doff) */

ldxa [%o4] SRC_ASI_BYTES, %l0 ! x0 = *base;

sllx %i4, 3, %i3 ! lshift = soff * 8;
cmp %i4, %i5

sllx %l0, %i3, %l0 ! x0 <<= lshift;
bleu %xcc, 0f ! if (soff > /*clen*/doff) {
inc 8, %o4
ldxa [%o4] SRC_ASI_BYTES, %l1 ! x1 = *(base + 8);
neg %i3 ! rshift = -lshift;
srlx %l1, %i3, %l1 ! x1 >>= rshift;
or %l0, %l1, %l0 ! x0 |= x1;
0: ! }

/* compute srcmask */
sllx %o3, 3, %o1 ! t1 = 8 * a;
mov -1, %o0
srlx %o0, %o1, %o0 ! srcmask = ~(~0 >> t1);
not %o0

/* store up to 7 bytes -- dst unaligned, but only one word */
and %l0, %o0, %l0 ! x0 &= srcmask;
andn %i1, 7, %o4 ! base = dst & ~7;
ldxa [%o4] DST_ASI_BYTES, %l1 ! x1 = *base;

sllx %i5, 3, %i3 ! rshift = doff * 8;

/* sneak these in here */
add %i0, %o3, %i0 ! src += a
add %i1, %o3, %i1 ! dst += a
sub %i2, %o3, %i2 ! len -= a
and %i0, 7, %i4 ! soff = src & 7;

srlx %o0, %i3, %o1 ! t1 = srcmask >> rshift;
andn %l1, %o1, %l1 ! x1 &= ~t1;
srlx %l0, %i3, %o2 ! t2 = x0 >> rshift;
or %l1, %o2, %l1 ! x1 |= t2;
stxa %l1, [%o4] DST_ASI_BYTES ! *base = x1;

LABEL(dstaligned)
/*
* Okay, we now have:
*
* %i0 src (possibly unaligned)
* %i1 dst
* %i2 len
* %i4 soff (src&7)
*
* and the destination address is 0 mod 8. If we still have
* "lots" to do, use the fancy block ASIs.
*/
#ifdef USE_VIS
cmp %i2, 256 ! is this the "right" size?
bgeu %xcc, L(fancy)
nop
#endif

/*
* Calculate the number of xwords to move. Adjust len in advance,
* and use xword-copying-loops if src is aligned.
*/
srlx %i2, 3, %i3 ! %i3 = n_xwords
brz %i3, L(mop_up)

sllx %i3, 3, %o0 ! %o0 = n_xwords * 8
brz %i4, 4f
sub %i2, %o0, %i2 ! len -= %o0;

/*
* Ugly unaligned source. We could use our own alignment network,
* or we could use faligndata. As it turns out, they should both
* run at about 5 cycles per iteration without unrolling, and the
* faligndata trick is a pain to set up, so we will do this for now:
*
* base = src & ~7;
* src += n_xwords * 8;
* lshift = soff * 8;
* x1 = *base;
* rshift = -lshift;
* base += 8;
* do {
* x0 = x1 << lshift;
* x1 = *base;
* base += 8;
* x0 |= x1 >> rshift;
* *dst = x0;
* dst += 8;
* } while (--n_xwords != 0);
*/
andn %i0, 7, %o4
add %i0, %o0, %i0 ! src += %o0
sllx %i4, 3, %o3 ! lshift = soff * 8;
ldxa [%o4] SRC_ASI_BYTES, %l1 ! x1 = *base;
inc 8, %o4
sub %g0, %o3, %i5 ! rshift = -lshift;
0: ! do {
sllx %l1, %o3, %l0 ! x0 = x1 << lshift;
ldxa [%o4] SRC_ASI_BYTES, %l1 ! x1 = *base;
inc 8, %o4 ! base += 8;
dec %i3
srlx %l1, %i5, %o1 ! x0 |= x1 >> rshift;
or %l0, %o1, %l0
stxa %l0, [%i1] DST_ASI_BYTES ! *dst = x0;
brnz %i3, 0b ! dst += 8;
inc 8, %i1 ! } while (--n_xwords);
b,a %xcc, L(mop_up)

4:
/*
* Nicely aligned case. Just use ldxa/stxa. The Ultrasparc
* load buffer goes 9 deep, so we should unroll by 8, but
* that seemed to be slower than unrolling by 4.
*/
srlx %i3, 2, %o0 ! n_fours = n_xwords / 4;
brz,pn %o0, 1f
and %i3, 3, %i3 ! n_xwords &= 3;

/* copy %o0 four-word chunks */
rd %asi, %i5 ! save current %asi
wr %g0, DST_ASI_BYTES, %asi
0:
add %i0, 8, %l5
add %i0, 16, %l6
ldxa [%i0] SRC_ASI_BYTES, %l0
add %i0, 24, %l7
deccc %o0
ldxa [%l5] SRC_ASI_BYTES, %l1
ldxa [%l6] SRC_ASI_BYTES, %l2
ldxa [%l7] SRC_ASI_BYTES, %l3
stxa %l0, [%i1] %asi
stxa %l1, [%i1 + 8] %asi
stxa %l2, [%i1 + 16] %asi
stxa %l3, [%i1 + 24] %asi
inc 32, %i0
bnz %xcc, 0b
inc 32, %i1

wr %i5, 0, %asi ! restore %asi

1: /* zero through three xwords left to copy. */
brz %i3, L(mop_up)
nop
0:
ldxa [%i0] SRC_ASI_BYTES, %l0
inc 8, %i0
dec %i3
stxa %l0, [%i1] DST_ASI_BYTES
brnz %i3, 0b
inc 8, %i1

LABEL(mop_up)
/*
* Mop up trailing bytes. Still have %i0 = src, %i1 = dst,
* %i2 = len, and %i4 = soff, with doff==0.
*/
brz %i2, L(done)
andn %i0, 7, %o4 ! base = src & ~7;
sub %g1, %i2, %o5 ! clen = 8 - len;
ldxa [%o4] SRC_ASI_BYTES, %l0 ! x0 = *base;
sllx %i4, 3, %i3 ! lshift = soff * 8;
cmp %i4, %o5
sllx %l0, %i3, %l0 ! x0 <<= lshift;
bleu %xcc, 0f ! if (soff > clen) {
inc 8, %o4
ldxa [%o4] SRC_ASI_BYTES, %l1 ! x1 = *(base + 8);
neg %i3 ! rshift = -lshift;
srlx %l1, %i3, %l1 ! x1 >>= rshift;
or %l0, %l1, %l0 ! x0 |= x1;
0: ! }
/* store to aligned boundary */
sllx %i2, 3, %o1 ! t1 = 8 * len;
mov -1, %o0
ldxa [%i1] DST_ASI_BYTES, %l1 ! x1 = *dst;
srlx %o0, %o1, %o0 ! ~srcmask = (~0 >> t1); (invert sense)
andn %l0, %o0, %l0 ! x0 &= ~~srcmask; (ie, &= srcmask)
and %l1, %o0, %l1 ! x1 &= ~srcmask; (ie, &= ~srcmask)
or %l1, %l0, %l1 ! x1 |= x0;
b L(done)
stxa %l1, [%i1] DST_ASI_BYTES ! *dst = x1;
--
In-Real-Life: Chris Torek, Berkeley Software Design Inc
El Cerrito, CA, USA Domain: to...@bsdi.com +1 510 234 3167
http://claw.bsdi.com/torek/ (not always up) I report spam to abuse@.

Terje Mathisen

unread,
Apr 14, 2000, 3:00:00 AM4/14/00
to
Chris Torek wrote:
>
> Background: Ultrasparc has block load/store operations (as part of
> the VIS instruction set) that load or store 64 bytes into the FPU
> register file. These must be on a 64-byte aligned address, both
> in memory and in the register file. There is a separate mechanism
> for doing byte realignment. Both the load and store avoid allocating
> in the caches (L1 and L2), which of course is important to avoid
> wiping them out when doing large copies.

[_very_ nice & informative discussion snipped]

> Note: I wrote the following code that uses just the integer registers
> to do arbitrarily-aligned memory copies using minimal loads and
> stores. Timing it, vs the VIS instructions, on 64-byte aligned
> copies (and using only two on-the-fly register sets rather than
> three, in the VIS code), shows that the VIS instructions are really
> only very slightly faster, and then only for large blocks, presumably
> due to the cost of saving and restoring the FPU registers (which
> I have been doing with regular stores rather than block-store, for
> my timings). So I assume the real gains from using BLD/BST to do
> block copies, at least for sizes in the 1024 to 65536 byte range or
> so, is that it does not overwhelm the rest of the cache.

Did you use separate code to fill/flush the different cache lavels
before timing?

It would be very interesting to see what kind of performance you get as
a function of target and/or source cache level, and size of the
transfer!

Terje

PS. In a somewhat similar vein, it seems I have (re-)invented/discovered
a way to use MMX and cache-bypassing/streaming load operations to do
TCPIP checksums about 30% faster than the best unrolled integer code,
even for data which is completely resident in L1 cache.

This works on at least two 64-bit blocks simultaneously, using parallel
add, compare and sub to calculate the checksum while doing carry
wraparound.

The interesting points were

a) that MMX has no unsigned compare which I could use to detect the need
for carry wraparound, but by biasing the accumulator with 0x8000, I
could use signed compares to detect integer overflow instead!

and

b) it was possible to write the code in such a way that all reg-reg
moves could be avoided, even with a two-operand instruction set,
resulting in just the (minimum possible?) 4 instructions per 64-bit
block:

movq mm0,[esi] ; Load 64 bits
paddw mm1,mm0 ; Add to sum as 4 16-bit short words
pcmpgtw mm0,mm1 ; Do a signed compare, setting those parts that
overflowed to 0xff (-1)
psubw mm2,mm0 ; Add in the carries (overflows) by subtracting -1

To make this work, the accumulator has to alternate between at least one
pair of registers, while storing all the carries in a separate register.

--
- <Terje.M...@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"

Chris Torek

unread,
Apr 14, 2000, 3:00:00 AM4/14/00
to
In article <38F6BEEB...@hda.hydro.com>

Terje Mathisen <Terje.M...@hda.hydro.com> writes:
>Did you use separate code to fill/flush the different cache lavels
>before timing?

No; and this would probably make a big difference -- when doing
small (under 512 byte) copies, all of the source data will be in
all caches in my test program. I am reluctant to work hard to
flush the cache because (a) I am doing this stuff in user mode
where it is somewhat hard (I can still do displacement flushes of the
L1 cache, but I have no control over, or view of, physical addresses,
which you need to know to get displacement flush to work) and (b)
I only have one Ultrasparc-II CPU for testing on, and I imagine
the results will be different for Ultrasparc I, II, IIi, and III,
and even just for different CPU speeds of the II (because the
CPU to L1 to L2 to memory cycle setups will be different).

I can use the "block commit" mode to wipe both away L1 and L2 cache
entries, though; I might do that. That would give results for
"copies when there is no data in any cache", but not "copies when
the data is in the L2 (E-) cache but not the L1 (D-) cache". This
is not quite what I want, as I expect that the source data will,
in most typical small-to-medium-copy-size cases, be in at least
the L2 cache, if not both caches. (My definition of "medium size"
here includes 8192-byte file system blocks, because one highly-typical
case occurs when a program goes to read or write a file, one block
at a time, using an 8192-byte-block file system.)

This code will wind up in the kernel, and as it happens, in general,
a "small" copy within or into the kernel occurs because the code
is going to use the data immediately:

bcopy(unaligned, aligned, size);
... use aligned[i] ...

or:

struct smallish gloop;

error = copyin(user_addr, &gloop, sizeof gloop);
if (error) return error;
... use gloop.field ...

In this case, having the copied (written) data wind up in the L1
cache is not a bad thing at all. If the bcopy or copyin writes
only to L2 cache and/or main memory, the immediate "use" operations
wind up having to go back to L2 cache or main memory, even though
the bytes were just inside the CPU.

Terje Mathisen

unread,
Apr 15, 2000, 3:00:00 AM4/15/00
to
Chris Torek wrote:
>
> In article <38F6BEEB...@hda.hydro.com>
> Terje Mathisen <Terje.M...@hda.hydro.com> writes:
> >Did you use separate code to fill/flush the different cache lavels
> >before timing?
>
> No; and this would probably make a big difference -- when doing
> small (under 512 byte) copies, all of the source data will be in
> all caches in my test program. I am reluctant to work hard to
> flush the cache because (a) I am doing this stuff in user mode
> where it is somewhat hard (I can still do displacement flushes of the
> L1 cache, but I have no control over, or view of, physical addresses,
> which you need to know to get displacement flush to work) and (b)

I'm pretty sure that a sequential scan (with cache-line size stride)
through a large buffer (2x to 4x the largest cache) will flush pretty
much everything.

It is harder to get controlled loading into just L3, L2 and/or L3, than
to move into L1, unless you can use cache-level specific prefetch
operations.

> I only have one Ultrasparc-II CPU for testing on, and I imagine
> the results will be different for Ultrasparc I, II, IIi, and III,
> and even just for different CPU speeds of the II (because the
> CPU to L1 to L2 to memory cycle setups will be different).

There will definitely be differences between cpu generations, for cpu
speeds I like to measure everything both in clock cycles and real time,
this way it becomes quite obvious which memory heirarchy levels run at
some fixed cpu clock ratio and which run at some other fixed speed.

> I can use the "block commit" mode to wipe both away L1 and L2 cache
> entries, though; I might do that. That would give results for
> "copies when there is no data in any cache", but not "copies when
> the data is in the L2 (E-) cache but not the L1 (D-) cache". This
> is not quite what I want, as I expect that the source data will,
> in most typical small-to-medium-copy-size cases, be in at least
> the L2 cache, if not both caches. (My definition of "medium size"
> here includes 8192-byte file system blocks, because one highly-typical
> case occurs when a program goes to read or write a file, one block
> at a time, using an 8192-byte-block file system.)

[snip]

> This code will wind up in the kernel, and as it happens, in general,
> a "small" copy within or into the kernel occurs because the code
> is going to use the data immediately:
>
> bcopy(unaligned, aligned, size);
> ... use aligned[i] ...

Yeah, in this case it is pretty easy to figure out that loading to L1 is
the best bet, it is more interesting when handing data off to some
external interface, since these blocks will probably not be used again
by the cpu.

Terje

McCalpin

unread,
Apr 15, 2000, 3:00:00 AM4/15/00
to
In article <38F868B8...@hda.hydro.com>,
Terje Mathisen <terje.m...@hda.hydro.com> wrote:
>Chris Torek wrote:
>>
>> [...] I am reluctant to work hard to

>> flush the cache because (a) I am doing this stuff in user mode
>> where it is somewhat hard (I can still do displacement flushes of the
>> L1 cache, but I have no control over, or view of, physical addresses,
>> which you need to know to get displacement flush to work) and (b)
>
>I'm pretty sure that a sequential scan (with cache-line size stride)
>through a large buffer (2x to 4x the largest cache) will flush pretty
>much everything.

It will certainly flush a lot, but there are typically two
problems:
1. all the lines in the cache may not be mapped
by your "large buffer" because of pseudo-random
page coloring.
2. even for the lines you hit, you are not guaranteed
to replace all the congruence classes, because of
pseudo-random replacement algorithms.

It is common to get 5%-10% cache hits after flushing the cache
using a buffer 4x the cache size.

You should try the experiment sometime -- assume that you have
a 256kB, 4-way set associative L2 cache with 32 Byte cache lines
and 4 kB pages.

There are 2048 lines in the cache, each of which is 4-way set
associative. You can think of the cache as being 4 parallel
caches, each holding 64kB, or 16 pages.

The low-order 5 bits of the address specify the byte in the
cache line, the next 7 bits specify which line this is in the
page, and the next 4 bits specify which "color" the page is.

So as you stream through a 1 MB data array, for example, you
will access 256 pages, each of which will map to one of the 16
colors of the cache.

So generate 256 random numbers between 0 and 15. These are your
page colors. Now sort them and look at the distribution of how
many pages map on to each of the 16 colors of the cache. You need
at least 4 hits on each color (plus perfect LRU replacement) to
ensure that the whole cache has been touched by you large buffer
array.

So what is the probability that at least one of the 16 colors of
the cache is mapped less than 4 times? I don't know the answer
for this one -- anybody stay awake through statistics? My
experience has been that it is easy to get 90% coverage, and very
hard to get over 95% coverage.

Of course if your page colors are pathological, rather than random,
then you may never touch the whole cache -- you may touch only 1/16
of it!
--
John D. McCalpin, Ph.D. mcca...@austin.ibm.com
Senior Scientist IBM Power Microprocessor Development
"I am willing to make mistakes as long as
someone else is willing to learn from them."

Chris Torek

unread,
Apr 15, 2000, 3:00:00 AM4/15/00
to
I wrote:
>>>... I am reluctant to work hard to flush the cache because (a) I am
>>>doing this stuff in user mode where it is somewhat hard (I can still
>>>do displacement flushes of the L1 cache, but I have no control over,
>>>or view of, physical addresses, which you need to know to get
>>>displacement flush to work) and (b) ...

>In article <38F868B8...@hda.hydro.com>


>Terje Mathisen <terje.m...@hda.hydro.com> says:
>>I'm pretty sure that a sequential scan (with cache-line size stride)
>>through a large buffer (2x to 4x the largest cache) will flush pretty
>>much everything.

In article <8darkk$1fas$1...@ausnews.austin.ibm.com>


McCalpin <mcca...@gmp246.austin.ibm.com> writes:
>It will certainly flush a lot, but there are typically two
>problems:
> 1. all the lines in the cache may not be mapped
> by your "large buffer" because of pseudo-random
> page coloring.
> 2. even for the lines you hit, you are not guaranteed
> to replace all the congruence classes, because of
> pseudo-random replacement algorithms.
>
>It is common to get 5%-10% cache hits after flushing the cache
>using a buffer 4x the cache size.
>
>You should try the experiment sometime -- assume that you have
>a 256kB, 4-way set associative L2 cache with 32 Byte cache lines

>and 4 kB pages. ...

This is all true, but I was not thinking thoroughly enough in this
specific case, on the Ultrasparc. Here the problem simplifies,
because there is a "flush level 1 and 2 cache" addressing mode (the
"block store with commit" operation), which can be used to push
anything from both caches; and the L1 cache is direct-mapped and
virtually indexed.

(Actually, the latter is a lie: the L1 cache on Ultrasparc I and
II, at least, is physically indexed with a free bit of bizarreness
thrown in to pretend it is virtually indexed. The page size is 8K
and the cache size is 16K; thus, aside from the top bit, virtual
and physical indexing gives exactly the same result, for a
direct-mapped cache. The sneaky bit is: *any* load from memory --
even using the "bypass L1 cache and load directly from physical
address" modifier! -- looks in *both* halves of the cache. If
there is a hit, the cache supplies the data. Otherwise, a regular
virtually-addressed load goes on to look in the L2 cache, and then
allocates a line in the left or right half based on the virtual
address supplied with the load.)

Anyway, since the L1 cache is "virtually" indexed, displacement
flush is easy even from user mode, so one can easily force "present
in L2, absent from L1". Using commit-store, one can easily force
"absent from both". "Present in L1, absent from L2" is an impossible
(error) condition -- the L2 cache manages the sharing protocol for
different entities on the bus.

In any case, this is all for our own OS, so even if it were that
big a problem, I could just give myself a temporary "flush the
cache" system call. :-) But -- the reasons I gave in part (b)
still apply: I am not sure I *want* to consider the "in L2 only"
case for small copies, and for large copies, the effect of the few
possible L1 hits is not very interesting.

The ultimate problem, anyway, is that I have only one system on
which to make measurements, and I want to know what Sun say about
the right way to get performance on *all* VIS-instruction-set
systems. Should we try to have more loads and stores in flight
(using the entire FP register set), or is "one load pending, one
feeding FALIGNDATA instructions" guaranteed to perform just as
well? And: when the blocks are well aligned, e.g., page-aligned
copies, there ought to be some way to just do "load; store", so
that the CPU is not required to be at least 10 times faster than
memory. (Remember the required synchronization:

BLD [src], regs0; src += 64 # these group, so take 1 cycle
BLD [src], regs1; src += 64 # so do these
loop:
fmovd any,any # this is a standalone group
fmovd any,any # as is this
... (8 times total)
# only now is the first BLD is guaranteed complete


BST [dst], regs0; dst += 64
BLD [src], regs0; src += 64

...

Since each BLD and BST also takes one instruction group, and eight
FP-group instructions are required in between for synchronization,
we have a minimum of ten pipeline steps for each block-transfer.
If we just happened to get lucky and the source data were all in
cache, it seems to me we could receive all 64 bytes in under 10
CPU clock cycles.) If two loads can progress simultaneously, and
a third BLD guarantees that the first BLD will be complete, then
a three-deep unroll using three of the four FPU register blocks:

BLD [src], regs0; src += 64
BLD [src], regs1; src += 64
BLD [src], regs2; src += 64

loop for n := 3 step 3 to n_blocks do


BST [dst], regs0; dst += 64
BLD [src], regs0; src += 64

BST [dst], regs1; dst += 64
BLD [src], regs1; src += 64

BST [dst], regs2; dst += 64
BLD [src], regs2; src += 64
end loop

would be perfect for the "well aligned" case, and the CPU only
has to be at least twice as fast as memory to guarantee maximum
transfer rates. Or we could go four deep, using all four blocks,
just to avoid the need to divide by three. :-)

David Wragg

unread,
Apr 16, 2000, 3:00:00 AM4/16/00
to
mcca...@gmp246.austin.ibm.com (McCalpin) writes:
> You should try the experiment sometime -- assume that you have
> a 256kB, 4-way set associative L2 cache with 32 Byte cache lines
> and 4 kB pages.
>
> There are 2048 lines in the cache,
^^^^
> [snip]

256 * 1024 / 32 = 8192

I think you meant that there are 2048 lines for each "way". The other
numbers are correct.


David Wragg

Terje Mathisen

unread,
Apr 16, 2000, 3:00:00 AM4/16/00
to
McCalpin wrote:
>
> In article <38F868B8...@hda.hydro.com>,
> Terje Mathisen <terje.m...@hda.hydro.com> wrote:
> >I'm pretty sure that a sequential scan (with cache-line size stride)
> >through a large buffer (2x to 4x the largest cache) will flush pretty
> >much everything.
>
> It will certainly flush a lot, but there are typically two
> problems:
> 1. all the lines in the cache may not be mapped
> by your "large buffer" because of pseudo-random
> page coloring.
> 2. even for the lines you hit, you are not guaranteed
> to replace all the congruence classes, because of
> pseudo-random replacement algorithms.
>
> It is common to get 5%-10% cache hits after flushing the cache
> using a buffer 4x the cache size.

I agree, this was the specific reason I didn't simply claim that 4x
cache size would flush everything.

>
> You should try the experiment sometime -- assume that you have
> a 256kB, 4-way set associative L2 cache with 32 Byte cache lines
> and 4 kB pages.

Been there, done that. :-)

My first memory stream test program would measure MB/s for an (N) KB
load or copy, after doing (M) KB of sequential reads from another big
buffer.

On x86 up to Pentium, it turned out that reading 2x the L1 cache size
was enough to get effectively zero L1 cache hits, i.e. transfer rates
would stay nearly constant independent of the initial flush size.

Other cpus have different cache replacement algorithms, of course.

[explanation snipped]

> So what is the probability that at least one of the 16 colors of
> the cache is mapped less than 4 times? I don't know the answer
> for this one -- anybody stay awake through statistics? My
> experience has been that it is easy to get 90% coverage, and very
> hard to get over 95% coverage.

The important part here is that the results you get from doing
measurements this way, is (modulo the data being freshly delivered by
some busmaster io device) quite representative of what you'll experience
as close to worst case performance in real life.

When testing different approaches to prefetching during some processing
task, the difference between 90% and 100% L2 cache miss doesn't matter
much either, since the knee in the performance plot corresponding to
sufficient/optimal prefetch size & distance will be very noticable.

>
> Of course if your page colors are pathological, rather than random,
> then you may never touch the whole cache -- you may touch only 1/16
> of it!

This is much less probable when doing the testing in "real mode", where
phys == virt for all addresses. :-)

Maynard Handley

unread,
Apr 19, 2000, 3:00:00 AM4/19/00
to
In article <8darkk$1fas$1...@ausnews.austin.ibm.com>,
mcca...@gmp246.austin.ibm.com (McCalpin) wrote:

>It will certainly flush a lot, but there are typically two
>problems:
> 1. all the lines in the cache may not be mapped
> by your "large buffer" because of pseudo-random
> page coloring.
> 2. even for the lines you hit, you are not guaranteed
> to replace all the congruence classes, because of
> pseudo-random replacement algorithms.
>
>It is common to get 5%-10% cache hits after flushing the cache
>using a buffer 4x the cache size.
>

>You should try the experiment sometime -- assume that you have
>a 256kB, 4-way set associative L2 cache with 32 Byte cache lines
>and 4 kB pages.
>

>There are 2048 lines in the cache, each of which is 4-way set
>associative. You can think of the cache as being 4 parallel
>caches, each holding 64kB, or 16 pages.
>
>The low-order 5 bits of the address specify the byte in the
>cache line, the next 7 bits specify which line this is in the
>page, and the next 4 bits specify which "color" the page is.
>
>So as you stream through a 1 MB data array, for example, you
>will access 256 pages, each of which will map to one of the 16
>colors of the cache.
>
>So generate 256 random numbers between 0 and 15. These are your
>page colors. Now sort them and look at the distribution of how
>many pages map on to each of the 16 colors of the cache. You need
>at least 4 hits on each color (plus perfect LRU replacement) to
>ensure that the whole cache has been touched by you large buffer
>array.
>

>So what is the probability that at least one of the 16 colors of
>the cache is mapped less than 4 times? I don't know the answer
>for this one -- anybody stay awake through statistics? My
>experience has been that it is easy to get 90% coverage, and very
>hard to get over 95% coverage.

This bugged me and bugged me till I came up with the answer. My route took
a long detour through apparently irrelevant material, but the final result
is simple and easily explained.
The probability that any particular set will have all four ways touched
when wiped 16 times is 1-4*(3/4)^16=96.0%.

I argue that of all the possible combinations that will leave say way 0
unwiped: the first of the 16 touches have a probability of 3/4 of not
touching way 0, then again for the next touch and so on, for a total
probability of (3/4)^16 of not touching way 0. Likewise for ways 1, 2, 3
giving a total probability of not touching a way of 4*(3/4)^16. I think
this is correct and it seems pretty obvious and agrees with my much more
complicated construction.

Maynard

Andrei A. Dergatchev

unread,
Apr 23, 2000, 3:00:00 AM4/23/00
to
[...]

>Note: I wrote the following code that uses just the integer registers
>to do arbitrarily-aligned memory copies using minimal loads and
>stores. Timing it, vs the VIS instructions, on 64-byte aligned
>copies (and using only two on-the-fly register sets rather than
>three, in the VIS code), shows that the VIS instructions are really
>only very slightly faster, and then only for large blocks, presumably

*** ^^^^^^^^^^^^^^^ ***


>due to the cost of saving and restoring the FPU registers (which
>I have been doing with regular stores rather than block-store, for
>my timings). So I assume the real gains from using BLD/BST to do
>block copies, at least for sizes in the 1024 to 65536 byte range or
>so, is that it does not overwhelm the rest of the cache.

[...]

Sorry for replying late on this thread.
I don't have any code which uses BLD yet, so I took an example
from Sun VIS development kit.
To compare performance with and without BLD, in one case I
use it:
ldda [sa]ASI_BLK_P,A0; \
cmp sa,se; \
blu,pt %icc,1f; \
inc 64,sa; \
dec 64,sa; \
and in another I don't:
ldd [sa + 0],A0; \
ldd [sa + 8],A1; \
ldd [sa + 16],A2; \
ldd [sa + 24],A3; \
ldd [sa + 32],A4; \
ldd [sa + 40],A5; \
ldd [sa + 48],A6; \
ldd [sa + 56],A7; \
cmp sa,se; \
blu,pt %icc,1f; \
inc 64,sa; \
dec 64,sa; \

I tested it on an E450 with 400MHz CPUs and Sun Workshop 5.
For my quick test, a version with BLD is about 5 times *slower*
then without it (i.e., with 8 ldd's). Compilation for v9a instead
of v8plusa makes only small difference.
For now I can't think of any satisfactory explanation for what
I'm observing though.
It is indeed interesting to check if unrolling in 3 BLD's cycles
will change anything significantly - I want to test it too.

Regards,

Andrei

The code is the following (to flush caches I use perfmon package;
vdk_vis_copy8_blk&vdk_vis_copy8_blk1 are linked from .S files
found in VIS development kit):

#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#include <fcntl.h>
#include "perfmon.h"

void vis_test (int fd, unsigned char *src1, unsigned char *dst1,
unsigned char *src2,
unsigned char *dst2, int xsize, int ysize, int zsize)
{
int ntime=1000, j;
hrtime_t start, end;

printf(" Size = %d \n", xsize*ysize*zsize);
/* Flush all caches */
ioctl(fd, PERFMON_FLUSH_CACHE);
start = gethrtime();
for (j = 0; j < ntime; j ++) vdk_vis_copy_8_blk(src1, dst1,
xsize*ysize*zsize);
printf(" VIS block-GL Time = %lld msec\n", (gethrtime() -
start)/1000000);
/* Flush all caches */
ioctl(fd, PERFMON_FLUSH_CACHE);
start = gethrtime();
for (j = 0; j < ntime; j ++) vdk_vis_copy_8_blk1(src2, dst2,
xsize*ysize*zsize);
printf(" VIS-GL Time = %lld msec\n", (gethrtime() -
start)/1000000);
printf("\n");
}

int main (void)
{
unsigned char *src, *dst, *sptr, *dptr;
int xsize, ysize, zsize, fd, j;

fd = open("/dev/perfmon", O_RDONLY);
if (fd == -1) {
perror("open(/dev/perfmon)");
exit(1);
}

xsize = 512;
ysize = 512;
zsize = 1;
sptr = (unsigned char *) malloc(xsize*ysize*zsize*sizeof(unsigned
char));
if (!sptr) {
fprintf(stderr, "Couldn't malloc space for src.\n");
exit(1);
}
memset(sptr, 0, xsize*ysize*zsize*sizeof(unsigned char));
src = (unsigned char *) (((unsigned int) sptr & (~0x3f)));

dptr = (unsigned char *) malloc(xsize*ysize*zsize*sizeof(unsigned
char));
if (!dptr) {
fprintf(stderr, "Couldn't malloc space for dst.\n");
exit(1);
}
dst = (unsigned char *) (((unsigned int) dptr & (~0x3f)));

vis_test(fd, src, dst, src, dst, xsize, ysize, zsize);
for (j = 0; j < xsize*ysize*zsize-4; j ++) dst[j]=dst[j+1];

if (src) free(src);
if (dst) free(dst);

return EXIT_SUCCESS;
}


Reply all
Reply to author
Forward
0 new messages