Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Chief Engineer for development of XBOX 360 CPU speaks.

9 views
Skip to first unread message

multi-core

unread,
Nov 5, 2005, 6:50:54 PM11/5/05
to
http://gametomorrow.com/blog/index.php/2005/10/27/xbox-360-cpu-details-described-at-mpr-fall-processor-forum/


XBOX 360 CPU Details Described at MPR Fall Processor Forum

Blogged under XBox by Jeff Brown on Thursday 27 October 2005 at 3:33 am

I was the Chief Engineer for the development of the XBOX 360 CPU chip
and Tuesday, October 25 I presented details about the chip and how we
developed it to Microprocessor Report's Fall Processor Forum.
Here's some of what I presented.

XBOX 360 CPU Project
Back in November of 2003 Microsoft and IBM made the following
announcement.

REDMOND, Wash., and EAST FISHKILL, N.Y. - Nov. 3, 2003 - Microsoft
Corp. today announced that it has entered into a semiconductor
technology agreement with IBM Corp. Under the agreement, Microsoft has
licensed leading-edge semiconductor processor technology from IBM for
use in future Xbox® products and services to be announced at a later
date.

That Announcement was really about the development and manufacturing of
the CPU chip for the XBOX360 Console.

Microsoft's engagement with IBM was through IBM's Engineering and
Technology Services Division. As you may be aware IBM's Services
offerings are a significant portion of IBM's total revenue. E&TS is
one of the newest parts of that business.

Through E&TS, Microsoft was able to take advantage of the significant
investment in Research and Development that IBM makes for system
design, component design, and semiconductor process technology.

Microsoft brought their gaming domain knowledge and software
development experience and worked with E&TS to apply our capabilities
to the task of developing a custom processor solution. We built off
IBM's extensive portfolio of established designs and research
projects to develop a personalized system architecture to match the
XBOX 360 system vision and engineered it use in consumer product.

Chip Overview
The CPU chip contains a 3-way symmetric multi-processor running at 3.2
Ghz.
The 3 processors share a 1 MB L2 cache and a Front side bus which
connects the CPU chip to the ATI graphics chip. The Front Side Bus has
a peak bandwidth of 21.6 G Byte / sec.

The chip also includes a significant portion of support logic that
provides test, Power On Reset control, and debug and trace functions.
There are eFuses used for array redundancy to improve manufacturing
yield. We also used efuses for configurable voltage control, and
parametric adjustment in the analog units.

The chip IOs provide the following:
· Front Side Bus
· Debug access to trace array data, performance monitor counters, and
critical control and timing information,
· JTAG,
· PowerOn Status condition codes,
· the voltage identifier for the variable voltage regulator which
supplies the CPU chip,
· EEPROM attach to hold configuration control data if it turned out
to be necessary.

Power PC Core
At 3.2 GHz this is the highest frequency Power PC architecture core IBM
is shipping.
The cpu core is a dual issue in order execution micro-architecture with
simultaneous multi-threading and support facilities for 2 threads.
Because dynamic power consumption is key we implemented extensive clock
gating to shutdown pipelines until instructions are active.

The L1 icache is a 32K Byte cache with parity error checking. It is
2-way set associative cache with 128B lines. 1st level translation for
instruction addresses is done using a 64 entry 2-way set associative
effective to real address translation cache

The 2 issued instructions can go to one of 5 execution pipes: Branch
which is really part of the Instruction unit, Load/Store , Fixed Point,
Floating Point and VMX. Difficult instructions are implemented via
microcode. At dispatch they are cracked and converted into multiple
micro-ops.

The branch unit includes a 4K Byte - 2 way set-associative Branch
History Table per thread.

The Fixed Point Pipe actually has two units. One to handle the Simple
operations like (add/sub, cmp, logical ops, and rotate). The other
handles the Complex ops like Multiply/Divide.

The Load/Store pipe handles access to the L1 Data cache and the storage
hierarchy.
Like the L1 Icache the L1 Dcache is a 32KByte cached with parity error
checking. However, it is 4-way set associative. It is store through and
provides Non-blocking access so a cache miss does not hold up a
subsequent hit.

1st level Data address translation is handled by a 64 entry 2-way
associative ERAT. 2nd level translation for both data and instructions
is handled by a 1K 4-way associative TLB which can be software as well
as hardware managed.

VMX 128
We developed a Microsoft unique implementation of VMX called VMX128
which focused on improving graphics, game physics, and artificial
intelligence.

Power management within the FPU / VMX128 units is especially valuable
as it is rare that all three cores would be running threads with active
numeric computation.

We implemented a Delayed Execution Issue Queue which reduces the
effective load latency to 2 cycle vs 8-10 cycles without it. There are
separate load target buffers for the FPU and VMX128 units that
essentially enables Out of Order FP/VMX execution relative to Loads and
Stores

We made a number of architectural changes to the VMX unit when we
created VMX128. We extended the number of Vector Registers from 32 to
128. All 128 Registers are directly-addressable and the original 32
Registers are mapped to the first 32 entries of 128-entry vector
register file. We also added a number of instructions:

· floating-point dot-product instructions supporting 3-vectors and
4-vectors
· Permute-class instructions for rotate and insert operations
· Pack / unpack instructions for converting Direct3D data types
to/from single-precision FP format
· storage access instructions to improve access to misaligned data

Finally we maintained binary compatibility with a subset of the
original PowerPC ISA

Shared L2
The shared 1MB L2 which supports the three CPUs is split into two
portions. One part connects the CPUs with the different dataflow queues
and runs at the processor frequency of 3.2GHz. The rest the L2
including the data arrays and the directory run at ½ the processor
frequency.

Commands from the 3 cores are queued and then arbitrated into a L2
directory control unit for processing. In order to improve caching
performance there are two copies of the directory, allowing
simultaneous core access and IO snoops. The directories have parity
based error detection.

Cacheable and Cache Inhibited store operations are processed through
different pipelines. The cacheable store pipe includes 8 store
gathering buffers per core. These 8 line buffers are non-sequential to
improve performance. The non-cacheable store pipe includes 4 store
gathering buffers per core. These 4 line buffers are sequential and
simplify ordering for non-cacheable ops.

The L2 Data array includes Single Bit Correct / Double Bit Detect ECC.

Included in the L2 Cache architecture are several features to support
high bandwidth data stream. To improve read streaming bandwidth we
focused on two things.
1. We added an Extended Data Cache Block Touch instruction which allows
a data prefetch to bypass the L2 and go directly into the L1. This
significantly reduces the L2 thrashing which can be an issue for
prefetching with smaller L2s.
2. We also implemented an aggressive hit under miss capability in the
core so that each core can have up to 8 loads outstanding.

To improve write streaming bandwidth we focused on three features:
1. Within the Core the L1s are write through so writes do not allocate
a line into the L1.
2. Within the L2 which is 8-way set associative we provide a
configurable L2 set locking capability that ensures that streaming
though the locked set does not thrash the rest of the cache,
3. Finally to support procedural geometry, modified data within the L2
can be read by the GPU without forcing a store to memory which could
cause a change of ownership or eviction of the line. One of the key
design objectives was high sustained bandwidth for this GPU read
operation

Front Side Bus
The Front Side Bus Architecture developed by IBM was fully customized
for the XBOX 360 gaming platform in order to meet throughput and
functional requirements. The Link Architecture utilized a specialized
packet structure with automatic hardware managed flow control, error
recovery, link training, and link management.

IBM took an end-to-end approach to the Front Side Bus architecture and
development. This includes design, verification, and test owned by IBM
with half of the link existing within ATI's GPU. In fact, a common
VHDL description, designed by IBM, is instantiated in the two chips
even though both chips are built with very different methodology,
technology, frequency, and data widths.

The transaction layer provides a common functional interface to the two
chips. It manages the Link Layer protocol for reliable packet delivery.
It also performs command reordering and manages the two virtual
channels. The two virtual channels are used for request and response
and were architected primarily for deadlock avoidance but they also
allow configurable performance by setting channel priority.

The Link Layer provides link training, error detection and
retransmission, as well as flow control. We architected a beefed up
soft error recovery mechanism to support the use of lower cost
manufacturing components. In addition, because the memory containing
the boot program resides across the link at the GPU or below, the link
initialization must be bullet proof without SW intervention.

The front side bus physical layer is structured as two unidirectional
links capable of transmitting 10.8 Gbyte/sec. Each link is made of two
single byte lanes. Each lane has one clock. The links are source
synchronous so that the receive clock is sent with the data.

The most demanding portions of the PHY design are the analog
transmitters and receivers. The analog components are implemented using
Current Mode Logic which supports the very low jitter and high noise
tolerance required.

Termination on the link to improve link signaling quality is controlled
dynamically at link training. Low tolerance resisters are dynamically
switched in and out to adjust the termination to 50ohms.

The physical link specification included the receiver and transmitter
performance, the chip package, and the board parametrics, layout, and
wiring constraints. The specification was created by IBM and used by
Microsoft to design the system board.

CPU Chip Package
The design of the CPU package presented a significant challenge due to
the combination of the power environment, the high frequency operation
of the CPU, the Front Side Bus frequency, and high volume low cost
system card and chip package goals.

The custom package design is a 2-2-2 Flip Chip PBGA which is 31mm by
31mm and supports the 2s,2p system card.

In order to operate the Front Side Bus reliably the package had to
support aggressive targets for differential signal attenuation between
the package ball and the c4, loss due to reflection within the package,
and cross talk between adjacent signal pairs

The package was designed to provide power distribution to the CPU chip
with no greater than 80mV droop at the circuit.

In the end the PHY design, FSB architecture, the link specification,
and the package design all worked together to close on a solution for
the system that could be manufactured.

Test and Debug Features
No serious CPU chip of this complexity would be complete without
comprehensive test and debug features.

The XBOX 360 CPU includes support logic for Array and Logic Build in
Selftest. AC BIST operates at full functional frequency which allows
for maximum defect coverage including marginally slow circuits. The
Analog PHY is functionally tested by an internal wrap test called PING
BIST. This test also operates at full PHY frequency.

The chip includes internal trace arrays which allow 1000's of key
internal signals to be traced. Extensive pattern matching for trigger
conditions provide an extremely useful logic analyzer. The external
debug bus provides a way to collect extended traces beyond what can be
held within the on board trace arrays.

Finally to support performance tuning of the gaming applications and
environment we implemented a set of performance counters which can be
set to collect event counts. The set of performance events which number
in the hundreds was defined during chip development to support
Microsoft's performance team.

All three of these key features were utilized during our accelerated
hardware and system bring-up and validation. In fact, they were one of
the key enablers for success.

Another key enabler was the extensive verification effort prior to the
release of design data to manufacturing.

Verification and Bringup
Success on this program required that the CPU Chip be right the first
time. Pass 1 hardware needed to be fully functional. That means the
front side bus had to run at 5.4GHz, the CPU had to run at 3.2 GHz, and
caches had to be enabled and operational.

One week after the CPU powered on in a bring-up system a demo game was
running with full chip functionality. So how did we do it?

Our strategy was pretty simple.
· Do as much as we can in parallel to make the most progress on a
short schedule
· Structure things hierarchically so bugs are found in environments
where they can be diagnosed and fixed the quickest.

We took advantage of different methodologies at the unit & subsystems
level and then created a unified methodology at chip & system levels.

At each level we had quality measurement standards based on coverage,
test suite, simulation cycles, and bug rate and they provided a
bottom-up events coverage view and also a top-down architecture view of
the coverage.

We held extensive reviews, leveraging the experience and knowledge of
IBM corporate verification experts. We even went so far as to take the
advice of the reviewers and made changes to our plans, our staffing,
and our tools!

We brought to bear the best of IBM's knowledge on verification by
tapping into our Research and Development teams. This included using
established uni and multi-processor test suites and intelligent
randomized test generation. We also used formal verification tools and
methods to prove key parts of the chip architecture where correct.

We validated many system level operations via co-simulation between
Microsoft and IBM. This way we were confident the system power on
reset, system level coherency, boot ROM code and key parts of the
kernal would operate correctly during bring-up.

Bring-up itself was done in three locations where each focused on
different parts of the overall tasks. In many ways it was a truly joint
effort with engineers from multiple companies working together and
bringing the best tools to solve the problems.

Success
We developed an XBOX 360 unique CPU chip, engineered and optimized for
the specific product constraints. We went from 1st silicon to volume
production in 8 months and if you get in line early enough you might be
able to buy one November 22.

0 new messages