PAPER 11/20: The Microarchitecture of the Pentium 4 Processor

Guofeng

unread,

Nov 17, 2007, 12:51:52 PM11/17/07

to ASU:CSE520 FALL 07 Advanced Computer Architecture

http://users.ece.gatech.edu/~leehs/ECE6100/papers/P4.pdf

Yi-hsin Tseng

unread,

Nov 22, 2007, 1:32:02 PM11/22/07

to asucse520-fall-07-advanc...@googlegroups.com

Hi,

Please find my critic attached.

Best regards,
Yi-hsin Tseng

On 17/11/2007, Guofeng <guofen...@gmail.com > wrote:

http://users.ece.gatech.edu/~leehs/ECE6100/papers/P4.pdf

Critic_Nov20_P4Microarchitecture.pdf

David.S....@asu.edu

unread,

Nov 26, 2007, 12:25:50 PM11/26/07

to ASU:CSE520 FALL 07 Advanced Computer Architecture

During the presentation, we got into a discussion regarding the
example on slide 12 that discusses the register renaming and register
alias table.

I modified the example to try to make it clearer. Thanks for your
feedback during the discussion.

We have the the following IA32 instruction sequence:

mov EAX 12 ; Load literal value 12 into register EAX
mov EBX 100 ; Load literal value 100 into register EBX
add EAX EBX ; Add value in register EAX to value in register EBX.
Store the result in register EAX
mov EAX 13 ; Load the literal value 13 into register EAX
mov EBX 200 ; Load the literal value 200 into register EBX
add EAX EBX ; Add value in register EAX to value in register EBX.
Store the result in register EAX

Now, the whole point of the register renaming is to reduce false
dependencies and allow the instructions to flow into the pipeline
without stalling.

In the above example, we can see that the two mov instructions have a
Write After Write false dependency because they both write to register
EAX. There is no data dependency here, however, both are writing
their result to the same register.

The CPU can remove the dependency by renaming the register that the
instruction references and allowing the instruction to continue into
the pipeline (become "in flight").

The P4 has a very deep pipeline. We do not want the second mov
instruction to have to stall before the previous add instruction
finishes.

The register file contains 128 registers which is many more than the
few logical registers that we program in (EAX, EBX, ECX, EDX, etc).

The register renaming maps these logical registers into actual
registers in the register file. This allows the register file to
contain multiple instances of EAX, EBX, etc all at one time.

So lets walk through the instruction sequence as follows:

The CPU encounters the instructions

mov EAX 12
mov EBX 100

The CPU sees that the register $t1 and $t2 are not in use so it
renames the instructions to reference them:

mov $t1 12
mov $t2 100

After renaming, the CPU makes an entry in the register alias table
(RAT) to note that EAX is currently mapped to actual register $t1 and
EBX currently points to $t2.

RAT[EAX] = $t1
RAT[EBX] = $t2

The reason it needs to do this is incase future instructions reference
EAX or EBX, the CPU will know that EAX is actually $t1 and EBX is
actually $t2.

Next the CPU encounters the instruction:

add EAX EBX

In the previous step, EAX was renamed to $t1 and EBX was renamed to
$t2. The CPU needs to know this so that it can re-write the registers
in the add instruction to the actual registers in the register file.
The CPU looks in the register alias table to discover that register
EAX is actually mapped to register $t1 and EBX is mapped to register
$t2.

It then renames the instruction registers to effect:

add $t1, $t2

Next the CPU encounters:

mov EAX 13 ; Load the literal value 13 into EAX
mov EBX 200 ; Load the literal value 200 into EBX

Assume that the previous instructions are still in-flight and have not
yet retired (registers $t1 and $t2 are still allocated because the
previous 3 instructions have not yet retired (committed).

We don't want the instruction to stall because of the previous WAW
hazard.

So the CPU checks the register file and sees that $t3 and $t4 are
available. It then renames the instructions to allow them to continue
into the pipeline:

mov $t3 13 ; Load the literal value 13 into EAX
mov $t4 200 ; Load the literal value 200 into EBX

After renaming, the CPU updates the register alias table again to
indicate that EAX now refers to $t3 and EBX now refers to $t4.

RAT[EAX] = $t3
RAT[EBX] = $t4

Now the CPU gets the final instruction:

add EAX EBX;

In the previous step, EAX was renamed to $t3 and EBX was renamed to
$t4. The CPU needs to know this so that it can re-write the registers
in the add instruction to the actual registers in the register file.
The CPU looks in the register alias table to discover that register
EAX is actually mapped to register $t3 and EBX is mapped to register
$t4.

The CPU then re-writes the registers in the instruction to effect:

add $t3 $t4;

And hence, all of the instructions flow into the pipeline together
without the WAW hazards causing stalls.

I hope this explains the point that I was trying to make a little
better.

-David

David.S....@asu.edu

unread,

Nov 26, 2007, 1:25:52 PM11/26/07

to ASU:CSE520 FALL 07 Advanced Computer Architecture

I also did not get a chance to talk about the "replay" logic
implemented in this chip.

Suppose we had the following instruction sequence:

Assume that register EAX contains the address (a pointer) of some
variable in memory.

mov EBX, [EAX] ;Move value pointed to by the pointer in register EAX
into register EBX
add EBX, EBX ; Double the value

Here is where an opportunity for speculation presents itself. When
the 'add' instruction is encountered, does the CPU let it go into
flight or should it wait for the mov instruction to retire since there
is a data dependency?

Due to the depth of the pipeline, if the value being retrieved from
memory is contained in the L1 data cache then the mov instruction will
finish and commit by the time the add instruction actually uses it.

The processor can speculate by assuming that there will be an L1 cache
hit.

Here is the logic:

1.) The CPU issues the mov instruction to start bringing in the value
from memory
2.) The CPU encounters the 'add' instruction which depends on the
previous 'mov' instruction to bring in the value from memory.
3.) The CPU goes ahead and issues the 'add' instruction even though
the 'mov' has not yet finished.

If the value is found in the L1 cache then it is brought in and the
'mov' instruction finishes and commits very quickly before the 'add'
reaches the execution unit.
The 'add' instruction reaches the execution unit and is executed using
the value that was brought in from the L1 cache.

If the value is not found in the L1 cache then it has to be brought in
from L2 or main memory.
The 'add' instruction reaches the execution unit before the previous
'mov' instruction finishes (since L1 missed). The instruction
executes with whatever invalid value happens to be in the register.
The CPU detects that the instruction mis-speculated and flushes it.
It then re-starts ('replays') the add instruction but not until the
'mov' instruction has finished.

The benefit is obviously that if you have a lot of L1 cache hits, you
get a performance boost by not stalling dependent instructions.

Tao (Tony) Liu

unread,

Nov 26, 2007, 11:54:39 PM11/26/07

to asucse520-fall-07-advanc...@googlegroups.com

Hi,Guys,

Attached please find out my comments on this paper.

Regards,
Tao Liu

P4_Paper_Comments_TAO_LIU.doc

Saleel Kudchadker

unread,

Nov 27, 2007, 3:40:23 AM11/27/07

to asucse520-fall-07-advanc...@googlegroups.com

Hi,

Please find my critique attached

Saleel

Critique_P4.pdf

Yi-hsin Tseng

unread,

Nov 27, 2007, 4:03:59 AM11/27/07

to asucse520-fall-07-advanc...@googlegroups.com

Hi,

I update my critique so I post it again.
Please see attachment.

Thank you,
Yi-hsin

critique_P4_Yihsin_Tseng.pdf

Sushma Myneni

unread,

Nov 27, 2007, 3:27:09 PM11/27/07

to asucse520-fall-07-advanc...@googlegroups.com

Hi All,

Check the attachment for my Critique on this paper.

Thankyou,

Sushma

Paper_Critique_11_20.doc

Pradnyesh Gudadhe

unread,

Nov 27, 2007, 3:57:26 PM11/27/07

to asucse520-fall-07-advanc...@googlegroups.com

Please find my critique attached.
Thanks.

-Regards,
Pradnyesh

--
Please visit my UPDATED website here:
http://www.geocities.com/paddyinpilani

Critique3.pdf

Pradnyesh Gudadhe

unread,

Nov 27, 2007, 3:58:23 PM11/27/07

to asucse520-fall-07-advanc...@googlegroups.com

Sorry. Wrong thread.
-Pradnyesh

ash...@gmail.com

unread,

Nov 27, 2007, 9:42:08 PM11/27/07

to ASU:CSE520 FALL 07 Advanced Computer Architecture

Hi,

Here's the summary for this paper:

The paper describes the architecture of Intel's Pentium 4 processor
and the NetBurst architecture. It talks about the key features used in
P4 such as the Trace cache, double-pumped ALUs, replay execution and
an advanced branch prediction algorithm. In the end, the performance
is compared with the Pentium III processor.

Strengths:
1) An advanced form of a Level-1 cache is used called the Execution
Trace Cache. It has the advantages of not having to decode
instructions repeatedly, having its own branch predictor and easily
handling of branches without much penalty. It also has a lower load-
use latency (2-clock load-use latency for integer loads and a 6-clock
load-use latency for floating point/SSE loads).

2) The Double Pumped ALU can perform common operations at twice the
speed of the main clock and thus execute 2 instructions per cycle.

3) Includes additional 144 128-bit SIMD instructions called SSE2 for
use in multimedia applications.

4) The hardware prefetcher to monitor data access patterns and
prefetch data automatically into L2 cache causes the performance to be
high.

5) The branch prediction algorithm used reduces the branch
misprediction rate by about 1/3 compared to the predictor in the P6
microarchitecture.

6) Even with the larger size occupied by the mops, the hit ratio is
still maintained similar to ordinary L1 caches of equivalent size.

Weaknesses:
1) There is no mention of power consumption and thermal management.
With a deep pipeline, complex circuitry and higher frequency of
operation the power consumption and heat generation is likely to be
more.

2) Lower amount of Instructions Per Clock (IPC) because deep
pipelined.

3) Unfair comparison between P3 and P4 because of the different clock
rate of the CPU.

4) Since the Pentium 4 is mainly designed for multimedia applications,
it is not completely suitable for generic software applications.
Accordingly, certain applications are known to run slow on the Pentium
4 as compared to Pentium 3.

5) Because the ALU runs at a higher clock frequency than other
components, it causes heat strokes in areas of high density on the
chip.

Ashay

On Nov 27, 1:58 pm, "Pradnyesh Gudadhe" <paddyinpil...@gmail.com>
wrote:
> Sorry. Wrong thread.
> -Pradnyesh

>
> On Nov 27, 2007 1:57 PM, Pradnyesh Gudadhe <paddyinpil...@gmail.com> wrote:
>
>
>
> > Please find my critique attached.
> > Thanks.
>
> > -Regards,
> > Pradnyesh
>

> > On Nov 27, 2007 1:27 PM, Sushma Myneni <smyne...@asu.edu> wrote:
>
> > > Hi All,
>
> > > Check the attachment for my Critique on this paper.
>
> > > Thankyou,
> > > Sushma
>

> > > On 27/11/2007, Yi-hsin Tseng <Yi-hsin.Ts...@asu.edu> wrote:
>
> > > > Hi,
>
> > > > I update my critique so I post it again.
> > > > Please see attachment.
>
> > > > Thank you,
> > > > Yi-hsin
>

> > > > On 27/11/2007, Saleel Kudchadker < skudc...@asu.edu > wrote:
>
> > > > > Hi,
>
> > > > > Please find my critique attached
>
> > > > > Saleel
>

> > > > > On 11/26/07, Tao (Tony) Liu < Tao.Li...@asu.edu > wrote:
>
> > > > > > Hi,Guys,
>
> > > > > > Attached please find out my comments on this paper.
>
> > > > > > Regards,
> > > > > > Tao Liu
>

> > > > > > On Nov 17, 2007 10:51 AM, Guofeng < guofeng.d...@gmail.com> wrote:
>
> > > > > > >http://users.ece.gatech.edu/~leehs/ECE6100/papers/P4.pdf
> > > > > > <http://users.ece.gatech.edu/%7Eleehs/ECE6100/papers/P4.pdf>

Reply all

Reply to author

Forward