Cache instructions to enhance mem write performance

Mike O'Connor

unread,

Oct 10, 1997, 3:00:00 AM10/10/97

to

Graeme Gill wrote:
>
> Can anyone list the processors that implement cache instructions
> designed to improve memory write performance, similar to the
> dcbz (Data Cache Block Set to Zero) Instruction in the Power PC
> architecture ?
>

The picoJava cores include a "zero_line" instruction
basically similar to the PPC dcbz instruction..

- Mike

Graeme Gill

unread,

Oct 10, 1997, 3:00:00 AM10/10/97

to

Can anyone list the processors that implement cache instructions
designed to improve memory write performance, similar to the
dcbz (Data Cache Block Set to Zero) Instruction in the Power PC
architecture ?

Such support is necessary for fast graphics on systems with an
allocate-on-write-miss type data cache. Without this support,
such systems suffer from a 2:1 write performance penalty, due
to unnecessary read traffic (Prime example, Intel Pentium Pro/II).

Graeme Gill.

Andy 'Krazy' Glew

unread,

Oct 15, 1997, 3:00:00 AM10/15/97

to Graeme Gill

On the Intel Pentium Pro you can avoid the unnecessary bus traffic due
to write allocate via read-for-ownership writeback caching in two ways:

(1) if the data is write-mainly, you can map it as memory type WC (Write
Combining). WC memory combines individual writes into cache line sized
bursts, but does not place it in the full data cache. Unfortunately, WC
reads are treated as uncached reads, and are hence slow.

(2) if the data is ordinary cacheable data, you can allocate it in the
data cache (L1 and L2) by performing a REP STOS or a REP MOVS.

If "Fast Strings" is enabled (typically a BIOS option on by default)
then the microcode for sufficiently large REP {STOS,MOVS}[BWD]
a) performs 64 bit accesses, grouped a cache line or so at a time
b) does not perform a read-for-ownership of the cache line on a
write miss, but instead buffers the entire line up, and does an
INValidate address-only bus transaction if the entire line is written.

I.e. a REP STOSD of zeroes for a large buffer is equivalent to several
DCBZ instructions for the same buffer.

The gotcha is that the buffer to be allocated must be somewhat larger
than a cache line for the fast microcode to kick in. But if you are
doing graphics or processing of large images, that is not a problem.

By the way, the next time I do something like this my tendency would be
to use more hardware and less software (aka microcode), since with a bit
more hardware the test to determine whether to use the fast or slow
strings would have been faster. I was stupidly in the RISC mentality of
"do it in software" (meaning microcode), and did not take advantage of
all the power of VLSI hardware.

Andy 'Krazy' Glew

unread,

Oct 21, 1997, 3:00:00 AM10/21/97

to Emil Naepflein

Emil Naepflein wrote:

> > Can anyone list the processors that implement cache instructions
> > designed to improve memory write performance, similar to the
> > dcbz (Data Cache Block Set to Zero) Instruction in the Power PC
> > architecture ?
>

> Only the MIPS R4K/R10K does not have this functionality.

I don't know about the R10K,
but I am fairly sure that the R2K
had coprocessor cache control instructions.

In particular, I think I remember that it had
"Change address tag on cache line"
which might be useful for fast copies.

My personal point of view is that all of these optimizations
need to be wrapped into a CISCy block copy operation,
or, even more ideally, into memory streams a la Wulf (hi Sally!)
so that arbitrary computation can be done on the stream as it passes.

I.e. integrate "Change address tag in cache line" if aligned properly
modulo cache set resonance,
use line oriented copy out/in from the array if cache line aligned
and hitting, use optimized non-allocating protocols when bus traffic
is necessary, use bus protocols that permit copying directly from one
memory chip to another, and use any special features your memory gives
you inside the chip, such as copying a row into the sense amp latches
and then copying it back to another address.

All of these are easy, and have been done (or at least thought of,
and described in this newsgroup, by yours truly).

The hard part is wrapping it all together.

John R. Mashey

unread,

Oct 23, 1997, 3:00:00 AM10/23/97

to

In article <344D170C...@cs.wisc.edu>, Andy 'Krazy' Glew <gl...@cs.wisc.edu> writes:

|> I don't know about the R10K,
|> but I am fairly sure that the R2K
|> had coprocessor cache control instructions.

I think you mean R4K and up, which have lots of them.

All I could get into the R2K of this sort was fairly weird:
a bit that let you swap I & D caches, used for flushing I-caches
slightly more efficiently, and only because of the bus-multiplexing.
I.e., rather than having to flush a page of the cache by jumping to
a properly-aligned page of NOPS, you'd swap the caches, then
execute a loop of store-byte instructions that would flush the words,
then swap I & D back...

This worked, just barely.
The OS folks who did the IRIX for Pwoer Series SMP weren't enthralled with it.

--
-john mashey DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL: ma...@sgi.com DDD: 650-933-3090 FAX: 650-932-3090
USPS: Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

John McCalpin

unread,

Nov 4, 1997, 3:00:00 AM11/4/97

to

In article <3469cbe3....@philos.philosys.de>,
Emil Naepflein <Emil.Na...@philosys.de> wrote:
>
>A question to John Mashey:

This is my territory -- I'm the bandwidth bigot!

>What is the data copy throughput (bcopy) on a 64 cpu Origin between
>the remotest locations and in local memory for data blocks
>significantly larger than the secondary cache?

On the Origin2000, my Fortran bcopy() with one local array and
one remote array runs like:
local 148 MB/s same as STREAM Copy of 296 MB/s
2 routers - 9% worst-case on 8-cpu machine
3 routers -12% worst-case on 16-cpu machine
4 routers -15% worst-case on 32-cpu machine
5 routers -18% worst-case on 64-cpu machine
6 routers -21% worst-case on 128-cpu machine

The last two numbers are estimates -- I have measured remote reads
on 64 and 128-cpu machines, but I did not do this particular bcopy
test on a machine bigger than 32-cpus.

It is interesting to note that the performance is the same for local
source/remote destination and for remote source/local destination.
--
--
John D. McCalpin, Ph.D. Supercomputing Performance Analyst
Technical Computing Group http://reality.sgi.com/mccalpin/
Silicon Graphics, Inc. mcca...@sgi.com 650-933-7407

John McCalpin

unread,

Nov 9, 1997, 3:00:00 AM11/9/97

to

In article <347370b3...@philos.philosys.de>,
Emil Naepflein <Emil.Na...@philosys.de> wrote:

>mcca...@frakir.engr.sgi.com (John McCalpin) (John McCalpin) wrote:
>
>> On the Origin2000, my Fortran bcopy() with one local array and
>> one remote array runs like:
>> local 148 MB/s same as STREAM Copy of 296 MB/s
>> 2 routers - 9% worst-case on 8-cpu machine
>> 3 routers -12% worst-case on 16-cpu machine
>> 4 routers -15% worst-case on 32-cpu machine
>> 5 routers -18% worst-case on 64-cpu machine
>> 6 routers -21% worst-case on 128-cpu machine
>>

>> The last two numbers are estimates [...]
>
>How large is the cache-line size?
>I assume 128 bytes.

128 bytes is correct.

>A machine using 64 bytes cache-line size has ONLY
>half of that, about 74 MB/s.

Close enough. This is a major reason why we have 128 byte cache
lines. As processors get faster, there will be more and more
incentive for everyone else to go to 128 byte lines as well.

>The scaling to more processors seems pretty good, but the absolute
>value seems really disappointing in the light of available connection
>and memory bandwidth.

I can certainly understand wanting more (and more and more) bandwidth,
but this should not be too disappointing --- it is the best in the
industry for a shared-memory system.

For comparisons, look at the STREAM Copy results and divide by two
to get bcopy numbers.

http://www.cs.virginia.edu/stream/

>As far as I can remember the memory bandwidth is
>about 800 MB/s. The problem seems do be the latency for the memory
>accesses, the necessary three bus transactions for moving the data and
>the limited number of active memory transactions of the R10K
>processor.

The details are complicated, but these are some of the important
issues.