Now that the first 21264 based systems have shipped, I'm allowed to
talk about this stuff.
Compiler/Assembly Language programming hints specific to the 21264
(also known as 'ev6').
The chip manual has not yet been externally released (no, I don't
know what part number it will appear under), but I expect it to be
released relatively soon.
These hints fall into roughly two areas, which aren't well documented
in the existing revs of the (limited distribution) manuals. Richard
Henderson (r...@cygnus.com) has a copy of the 21264 spec, and has added
support to egcs/gcc for those features which are well described.
I don't personally pretend to fully understand the microarchitecture
of the 21264, so I may not be able to fully answer questions about
it (even though I have a manual).
The 21264 is an out-of-order cpu, with 32 integer and 32 floating
point architectural registers. The 21264 implementation has
more internal (to the cpu) registers which permits the cpu
itself to perform register renaming. Additionally, it has a
trainable, multi-level branch prediction mechanism.
The implication of being an out-of-order implementation is that
the results of some instructions as seen in the program may be
computed well before other the results of physically adjacent
instructions. From a programming perspective, this makes the
CPU somewhat non-deterministic.
Quadpacks, Branch prediction, and [F]CMOVxx:
The instruction issuer in the 21264 grabs four instructions at
a time, thus the 'quadpack' name. The branch predictor can only
learn/predict from one control flow instruction per quadpack.
The issue queue will stall if there is a mis-predicted control-flow
change in a quadpack. This means that for maximal performance:
- Only put one control-flow instruction per quadpack
- Control-flow instructions need to be the last ones in
CMOVxx/FCMOVxx - Internally to the chip, this is handled as two
separate operations. In the case of highly consistent or
predictable data, a sequence written to use branches instead
of CMOVxx could be faster. For unpredictable/inconsistent data,
The CMOVxx sequence will be faster.
There are a number of these. If the CPU encounters a situation
which requires a replay trap (i.e. out-of-order state must be
unwound and performed in-order), there is a significant penalty
(13 cycles) incurred. Additionally, (I believe) all operations
which were in-flight will re-issue once the replay trap has
completed. The following replay traps exist:
Load-Load: I don't understand the circumstances under which this
will occur particularly well, so I'm not going to attempt
to describe how it can happen.
Load-Store: This is the situation in which one member of a
load/store pair (to the same memory location) executes
before the other member of the pair does. For example,
ldq $0, 0($16)
stq $1, 0($17)
If the STQ executes prior to the execution of the LDQ,
and both the LDQ and STQ are pointing at the same address,
this replay trap will occur, and the CPU is required to
unwind its internal state to generate correct results.
Size: For purposes of memory access and this particular trap,
pretend the 21264 cpu only operates on 4 and 8-byte
quantities. If there are two (or more) operations which
occur on the same longword or quadword in memory, but of
differing size, this trap will happen. The most frequent
occurance of this is probably going to be in the context
of code which is performing IEEE aware activities.
That is, code which might perform:
stq $f16, 0(sp)
ldl $0, 0(sp)
srl $0, N, $1
to look at the exponent or some specific bits of the mantissa
of a floating point number, and performing scaling to do a
table lookup. This occurs in various trigonometric functions.
The other place where this may occur is in a field-merging
optimization situation, but I don't think gcc/egcs for alpha
are capable of doing this.
Richard Gorton http://www.digital.com/amt
Compaq Computer Corporation All standard disclaimers apply.
"A committee is a cul-de-sac down which ideas are lured and then quietly
--Sir Barnett Cocks
I believe that this related to the definition of the one of the
"memory litmus" tests, stating that stores to the same address much
issue in-order. If you look in the Alpha architecture manual, there's
a section on litmus tests for memory like behavior. One litmus test
concerns ld-ld dependences, and this trap honors that. I've never
figured out why this litmus test exists.
> Load-Store: This is the situation in which one member of a
> load/store pair (to the same memory location) executes
> before the other member of the pair does. For example,
> ldq $0, 0($16)
> stq $1, 0($17)
> If the STQ executes prior to the execution of the LDQ,
> and both the LDQ and STQ are pointing at the same address,
> this replay trap will occur, and the CPU is required to
> unwind its internal state to generate correct results.
FWIW, the '264 has a prediction mechanism to stall the store when it's
likely to conflict with a prior load (based on prior behavior). For
the SPEC benchmark suite and a non-Digital model of a more aggressive
8-order machine, the '264 mechanism is accurate ~90% of the time. The
precise mechanism is described in a paper that appeared at ICCD'98 in
Dirk Grunwald Assoc. Prof, Univ. of Colorado at Boulder,
Currently living at DEC-WRL and having a darn good time of it.