Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion AMD Bulldozer optimization guide
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Andy (Super) Glew  
View profile  
 More options Jan 13 2012, 2:39 am
Newsgroups: comp.arch
From: "Andy (Super) Glew" <a...@SPAM.comp-arch.net>
Date: Thu, 12 Jan 2012 23:39:49 -0800
Local: Fri, Jan 13 2012 2:39 am
Subject: Re: AMD Bulldozer optimization guide
On 1/11/2012 10:15 AM, Tim McCaffrey wrote:

> http://support.amd.com/us/Processor_TechDocs/47414_15h_sw_opt_guide.pdf

> Talks a little about internals of the processor.

Some notes, and some thoughts:

p. 21 - a single macro-op can handle load and store to the same address,
whereas micro-ops can only be load and store.

This isn't new to Bulldozer - e.g. the Family 10h and 12h processors had
it - but I still find it interesting.  Heck, I think that it would be
entirely reasonable to perform a single address calculation and TLB
lookup for both the load part of an instruction like INC mem (unlocked,
to avoid other issues). Saves power, reduces uops, and

Many folks discuss whether or not stores should be split into separate
address and store operations within the processor.  P6 did this in
microcode, but more recent Intel processors have fused them, fired store
address and store data twice out of a single scheduler entry.  Mitch has
discussed his design (and I think also CDC? or Burroughs?) where you
sent an address to an address register, and then read or wrote the
corresponding data register to perform a memory reference
- which to some extent exposes the separation of address and data parts
of both loads and stores.

With separate loads and stores, an x86 instruction like INC mem might
look like:
     INC mem
         tmp := load( M[seg:+basereg+indexreg+offset] )
         tmp := tmp + 1
         store_address( M[seg:+basereg+indexreg+offset] += tmp )

With split store_address and store_data+
     INC mem
         tmp := load( M[seg:+basereg+indexreg+offset] )
         tmp := tmp + 1
         store_address( M[seg:+basereg+indexreg+offset] )
         store_data( tmp )

which you can really think of as:

  INC mem
    tmp := load( M[seg:+basereg+indexreg+offset] )
    tmp := tmp + 1
    store_buffer_entry.addr := &M[seg:+basereg+indexreg+offset] )
    store_buffer_entry.data := tmp
        // and, much later, after retirement, not an OOO uop:
        // M[store_buffer_entry.addr] := store_buffer_entry.data

Sharing the address computation for loads and stores
  INC mem
    lsq_entry.addr := &M[seg:+basereg+indexreg+offset]
    tmp := load( M[lsu_entry.addr] )
    tmp += tmp+1
    lsq_entry.store_data := tmp
        // and, much later, after retirement, not an OOO uop:
        // M[store_buffer_entry.addr] := store_buffer_entry.data

Optimizing
  INC mem
    tmp := load( M[lsq_entry.addr := &M[seg:+basereg+indexreg+offset]])
    lsq_entry.store_data := tmp+1
        // and, much later, after retirement, not an OOO uop:
        // M[store_buffer_entry.addr] := store_buffer_entry.data

p. 24 - the usual melange of partial registers and merging

p. 25 - load-execute instructions for unaligned data

32B instruction fetch
p. 26 - align hot loops to 32 bytes instead of 16
         (32B instruction fetch)

p. 30 - superforwarding

p. 33
D$ - 4 ways, way prediction
      16 banks, 16 bytes wide (i.e. intra cache line banks + inter)
      4 cycles load-to-use

L2$ load to use latency 20+ cycles
       mostly inclusive

L3$ - non-inclusive victim cache
     - not strictly exclusive, if only because of other cores

p. 34
Branch Prediction
    Predicted taken bubble
        - 1 cycle predicted by L1 BTB
        - 4 cyce predicted by L2 BTB
    Minimum branch misprediction penalty
        - 20 cycles conditional and indirect
        - 15 cycles unconditional direct and returns

(GLEW COMMENT: I'm a little puzzled by this.  I think that it means
that the unconditional direct branches and returns are recognized by the
decoder, even if BTB hitting.  Which is standard.  But it also seems to
imply that they have not optimized, much, the path reporting these,
what P6 called "shortstop branch mispredictions".
     Also, unconditional direct branches can always be handled at the
dcecoder, but returns can USUALLY but not always.  You can recognize a
return that missed the BTB, and use your return stack predictor - but it
is always possible that your return stack predictor may be wrong. So I
think this 15 cycle penalty applies when the return misses both BTBs,
but hits the return stack predictor.)

"... accessed using the fetch address of the current window. Each BTB
entry includes information about a branch and its target."
- since there may be more than one branch in a fetch block, I think they
are using the "ways" to handle that possibility.  Which is a slight
terminology stretch, but one that I have already encountered.

"Most of the time, as calls are fetched, the next return address is
pushed onto the return stack and subsequent returns pop a predicted
return address off the top of the stack. However, mispredictions
sometimes arise during speculative execution. Mechanisms exist to
restore the stack to a consistent state after these mispredictions."
- i.e. the stack isn't purely a stack.

Q: does Intel do this yet?

p. 35 TLBs

L1-DTLB - 32 entries, fully associative, multiple page sizes
L2-DTLB - 1024 with 4K, 2M, 1G, 8 way set associative
         - Q: how does this work?

p. 38 LSU

2 128 bit loads/cycle (to different L1 banks)
+ 1 128 bit store
Q: how does this reconcile with
"The LS unit is composed of two largely independent pipelines
enabling the execution of two memory operations per cycle"
-- I think it may mean that store commit is special. (Or ???)

24 entry store queue.
40 entry load buffer

p. 39

4 write combining buffers
*and* Write Coalescing Cache

- I suspect the write-through L1$ was causing pains.

Prefetchers

One prefetcher from DRAM ... into a prefetch buffer, NOT the L1, L2, or
L2. (Again, not new with Bulldozer, but still interesting.)  They don't
say how big a prefetch buffer.

And another prefetcher into the L1 or L2.

At page 104, they talk about 3 prefetchers: L1, L2 region prefetcher,
and the DRAM prefetcher.

p. 40 - 2 or 4 DRAM channels.

p. 41 - Hypertransport assist, aka probe filtering (not new).

Steals part of the L3 to use as a directory.

p. 52 store-to-laod-forwarding prediction
"The AMD Family 15h processor contains hardware to accelerate such
store-to-load dependencies, allowing the load to obtain the store data
before it has been written to memory."

They don't describe the predictor.

The University of Wisconsin sued Intel for a patent on such a predictor.
  My advisor, Guri Sohi, and several of my friends, (Anreas Moshovos,
etc.) were inventors on that patent. Intel settled.
I'm sure AMD did, too.

http://dotnet.sys-con.com/node/499104

p. 80 - load-execute instructions are preferred over separate load and
execute instructions.  (Not new.)

p. 81 - unaligned large (SIMD, 128 bit and 64, are recommemnded.  Better
misaligned support.

p. 82 - "Take Advantage of x86 and AMD64 Complex Addressing Modes"

segbase + basereg + indexreg<<scale + offset not penalized.

(x86s thrash about this.  P6 did, Banias didn't, etc.  Bulldozer does
it, like P6.)  (Well, except for segbase?)

p. 85 - partial register writes

Seems to use a merge on write scheme, not merge on read.

Optimization for initialing instructions, and writing upper bits of XMM
registers. 2 zsero bit flags dtaflow through processor.

"Another optimization recognizes MOVLPD/MOVHPD pairs and internally
converts the MOVLPD to a MOVSD xmm, mem."

p. 89 LEAVE instruction is recommended.

p. 90 SHLD deprecated - VectorPath.

Boo hoo, for people like me who like BitBlt.

p. 92, NOPs of various sizes.

"Note that NOP instructions which contain more than three prefix bytes
degrade performance"

p. 98 " A misaligned store or load operation suffers a minimum one-cycle
penalty in the processor’s loadstore
pipeline. Also, using misaligned loads and stores increase the
likelihood of encountering a storeto-
load forwarding pitfall, especially when operating in long mode"

"Store forwarding only occurs when the load
virtual address exactly matches the store virtual address and the store
size is greater than or equal to
the load size."

p. 102 "Choose linear addresses for the source and destination operands
of REP MOVS/CMPS that are not an
exact multiple of 4K pages away from each other."

They describe the STLF process in more detail than I am used to seeing:

"As mentioned in the previous section, store-to-load forwarding occurs
when the store address
matches the load address. This address match is split into two stages.
In the first stage, bits 4:11 of the
store and the load addresses are matched. In addition the double word
mask of the store and load
addresses is matched. The double word mask indicates whether the
load/store pair is accessing the
same double word in a 16-byte bank. If both these parameters match, then
a store-to-load forward is
initiated. In the second stage the remaining bits 12:47 of the store and
load addresses is matched. If
the remaining bits match, then the STLF is considered as a true STLF and
is allowed to proceed.
Otherwise it is considered as a false STLF and the load is cancelled and
retried."

p. 105 prefetching into unmapped pages can result in a significant delay.

(Hmm, I think this means that the AMD prefetcher can prefetch across 4KB
boundaries.  Does Intel do this yet? I.e. it operates ob virtual, not
physical, addresses.)

I suspect this means that invalid pages are NOT placed into the TLB.

If it is true that invalid addresses are not placed into the TLB, then
every prefetch to the same may page may produce a TLB miss.

GLEW OPINION: you need to cache invalid TLB entries, to constrain
prefetch and other speculation.  You may want to limit the number of
invalid TLB entries, so as to prevent invalids thrashing out valids.
Other schemes for constraining ...

p. 107 streaming stores are slower on Bulldozer!!!:

1 stream - ok

2 streams - 3X slower.  "Only" 1.5X slower by Family 15hv2

4 streams - 3X slower.  Comparable

"Using non-temporal stores but not writing out an entire cacheline may
cause performance to be up to six times slower than previous AMD
processors."

- I'm trying to figure out why for this last.  Some processors have
burst writes with byte valids - but I did not think AMD did.  Otherwise,
why would it be so much slower?

p. 115 - use WC instructions  to write code

"If normal store instructions are used to
write the code to memory, then the L2 cache lines will be in a modified
state. When the processor
eventually tries to execute the code, it will miss in the instruction
cache. Because the instruction
cache cannot contain cache lines that are in a modified state, the data
must be flushed to memory
before it can be fetched into the instruction cache. This unnecessarily
evicts possibly useful
information from the caches."

 From this we learn several things:

(1) Can't have data in I$ and in M state in D$.  I.e. the O part of
MOESI does not apply to the I and D cache.

(2) flushing to memory invalidates the D$?  no flush that leaves clean
data behind.  (Heck, ARM's cache protocol can support that.)

p. 116 - "On Family 15h parts, instruction caches will invalidate
aliases to a physical page that differ in virtual
addresses bits 14:12. When one physical address is mapped by two or more
virtual addresses that
differ in this way, a performance decrease may be observed. This problem
can be observed primarily
in Linux and other Operating Systems which enable ASLR (Address Space
Layout Randomization)
by default."

Ah, using virtual address bits... Ah, yes:  " 64-Kbyte, 2-way set
associative L1 instruction cache. Each line in this cache is 64 bytes long."

Not the D$.

Waddaya want to bet they did not have any ASLR workloads? And some
"smart" designer or architect, who had not kept up with security, wanted
to be clever by applying what they had learned in school?

"Disable ASLR. Note that this decision has security ramifications."
No shit.

I'm a security wonk.  This pisses me off.

More glass jaws

* PIC (Position Independent Code)

* Virtualization merging identical physical pages at differehnt virtual
addresses.

p. 120 compare-branch fusion, adjacent instructions. single uop.

p. 123 - CALL next instruction;POP idiom

p. 127 - "Branches Not-Taken Preferable to Branches Taken ...
Correctly-predicted taken branches have at least one prediction-based
bubble while not-taken branches do not. In addition, taken branches
consume more branch prediction resources."

p. 132 "With the pipelined floating-point adder allowing one FADD every
cycle [still to confirm]," :-)

p. 134 "For functions that create fewer than 25 machine instructions
once inlined, it is likely that the functioncall
overhead is close to, or more than, the time spent executing the
function body. In these cases,
function inlining is recommended"

"Function-call overhead on the AMD Family 15h processors can be low
because calls and returns are
executed very quickly due to the use of prediction mechanisms. However,
there is still overhead..."

p. 147 - XOR reg, reg idiom - "AMD Family 15h processors are able to
avoid the false read dependency on the XOR instruction."

Sounds as if this is new with Bulldozer - which is surprising. Intel has
done this since P6.

p. 169 move between integer GPRs and XMM through memory.  STLF stalls...

p. 169 Store ccombing or coalescing:

"The store data path on Family 15h is 128-bits wide. Stores are written
to both the L1 Data Cache and
the L2 via the CU module into the WCC (Write Combining Cache). Writes to
the Data cache which
are unaligned in an "address" are written in two cycles. If consecutive
unaligned addressed 128-bit
loads are written they can be coalesced such that the 64-bit portions of
128-bit writes which were
unaligned can be merged and written 128-bits at a time, removing most
the stall penalties. This is
performed in the Store Coalescing Buffer (SCB). A similar operation is
performed for those writes
which go to the L2 via the CU and WCC. There, the writes are coalesced
in the Coalescing Buffer
(CB). 128-bit stores are preferable because they can be dispatched in 1
uop and they only require one
store queue entry, thus putting less pressure on resources during
execution."

p. 187 MOV optimization

"The latency of certain XMM(SSE/AVX) move instructions that provide an
input operand to a
subsequent compute instruction can be hidden in all cases. This does not
apply to 256-bit operations."

"Move instructions of this type have no latency cost regardless of
location, as the hardware now
provides the alias to each and every instance.
This hardware optimization was initially designed to work with the
MOVAPD, MOVAPS,
MOVDQA, MOVDQU, MOVUPD, and MOVUPS instructions, but works well with
their AVX
variants regardless of 128-bit versions. Other SIMD move instructions
cause a two- cycle delay in
executing the dependent compute instruction. If at all possible, every
effort should be made to use
move instructions that the processor hardware can optimize."

p. 190 16 bit FP (F16c)

I had not noticed that AMD had added 16 bit FP to Bulldozer.
Oh, actually, I had - it's just called CVT16.

Intel had added it to LRB (R.I.P.) Q: has Intel added 16 bit FP to x86
yet, other than the LRB/MIC family?

I'm just waiting for people to start doing arithmetic in FP16.  Perhaps
not in FP16 on all operands, but possibly

FP32 += FP16

FP32 = FP32*FP16 + FP16

FP32 = FP32*FP16 + FP32

etc.

p. 203

"When a modification [to code] does cross an aligned 8-byte boundary,
then care must be taken that the executor
not see an invalid combination of the original code and the new code.
There is no architectural store
instruction, including instructions that use the lock prefix, to ensure
that an executor will not see a
combination of the original code and the new code."

Lesson learned the hard way: atomicity of instruction fetch matters.

p. 210  Memory Barriers

Memory Barriers in WB Memory
In AMD64 architecture, when using writeback (WB) type memory without
streaming stores, the only
barriers that require an explicit barrier instruction are of the types
Store/Load and Store/Store. In WB
memory, all other barriers are implicit.

3 ways: SFENCE or MFENCE, locked instructions, serializing instructions
such as CPUID

Note that LFENCE is not mentioned.  Q: does AMD make LFENCE a NOP in WB?

p. 229
  "When migrating virtual machines between platforms with different
operating frequencies, there may
be problems with software that is dependent on a constant frequency TSC.
Family 15h processors
provide a new MSR called "Timestamp Counter Ratio (TscRateMsr)" which
allows the frequency of
the timestamp counter to be scaled to a fraction of the maximum
processor frequency of the host
system."

p. 244

CMPXCHG8 is one cycle slower than CMPXCHG16/32/64.  Interesting. I
wonder why?

p. 315 special bypass latencies


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.