Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Divide by constant

4 views
Skip to first unread message

Spike

unread,
Feb 9, 2004, 2:01:55 PM2/9/04
to
I would like to divide x by 4500 (in mathematical terms: x/4500) but how can
I do this the fastest way?
(is there another way then using DIV`?)

TIA!

//Spike

Matt Taylor

unread,
Feb 9, 2004, 2:58:26 PM2/9/04
to
"Spike" <im_a...@hotmail.com> wrote in message
news:StqVb.81993$dP1.2...@newsc.telia.net...

Refer to the K7 Optimization Manual. You can replace a divide-by-constant
with a multiply. Modulo is not as easy, but you can get it with 2 multiplies
by using a % b == a - ((a / b) * b).

-Matt

Terje Mathisen

unread,
Feb 9, 2004, 3:48:47 PM2/9/04
to
Matt Taylor wrote:

Matt is right.

Years ago, I rediscovered and proved an algorithm to do this for all
possible values on an x86 (i.e. any 32-bit/32-bit -> 32-bit division).

If you look for Agner Fog's optimization guide (Google!), you'll find
the writeup Agner did of the algorithm.

Terje

PS. The short version is like this:

Calculate 1/4500 with at least 34 bits of precision, then round the
result up to have 33 bits.

If the final bit is zero, you're in luck, because you can then save a
cycle or two.
...
2^44/4500 = 3909374676.53689

So, to divide by 4500 you can use this instead:

; Assume number to be divided in EAX:

mov edx,3909374677
mul edx
shr edx,12

; At this point EDX contains the correctly truncated result of
; dividing EAX by 4500

--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Phil Carmody

unread,
Feb 9, 2004, 6:58:21 PM2/9/04
to
Terje Mathisen <terje.m...@hda.hydro.com> writes:

> Matt Taylor wrote:
>
> > "Spike" <im_a...@hotmail.com> wrote in message
> > news:StqVb.81993$dP1.2...@newsc.telia.net...
> >
> >>I would like to divide x by 4500 (in mathematical terms: x/4500) but how
> > can
> >
> >>I do this the fastest way?
> >>(is there another way then using DIV`?)
> > Refer to the K7 Optimization Manual. You can replace a
> > divide-by-constant
> > with a multiply. Modulo is not as easy, but you can get it with 2 multiplies
> > by using a % b == a - ((a / b) * b).
>
> Matt is right.
>
> Years ago, I rediscovered and proved an algorithm to do this for all
> possible values on an x86 (i.e. any 32-bit/32-bit -> 32-bit division).
>
> If you look for Agner Fog's optimization guide (Google!), you'll find
> the writeup Agner did of the algorithm.

There's an in-depth Dr Dobbs on this matter, dating from 1992-1993
I think, which deals with the 16/16->16 case. I don't believe it
covered the theory particularly deeply, but certainly had a very
easy learning curve.

Phil
--
Unpatched IE vulnerability: Extended HTML Form Attack
Description: Cross Site Scripting through non-HTTP ports,
stealing cookies, etc.
Published: February 6th 2002
Reference: http://eyeonsecurity.org/advisories/multple-web-browsers-vulnerable-to-extended-form-attack.htm

Andrew Kennedy

unread,
Feb 9, 2004, 7:42:43 PM2/9/04
to

Terje Mathisen <terje.m...@hda.hydro.com> wrote:
>Matt Taylor wrote:
>
>> "Spike" <im_a...@hotmail.com> wrote in message
>> news:StqVb.81993$dP1.2...@newsc.telia.net...
>>
>>>I would like to divide x by 4500 (in mathematical terms: x/4500) but how
>>
>> can
>>
>>>I do this the fastest way?
>>>(is there another way then using DIV`?)
>>
>>
>> Refer to the K7 Optimization Manual. You can replace a divide-by-constant
>> with a multiply. Modulo is not as easy, but you can get it with 2 multiplies
>> by using a % b == a - ((a / b) * b).
>
>Matt is right.
>
>Years ago, I rediscovered and proved an algorithm to do this for all
>possible values on an x86 (i.e. any 32-bit/32-bit -> 32-bit division).
>
>If you look for Agner Fog's optimization guide (Google!), you'll find
>the writeup Agner did of the algorithm.
>
>Terje
>
>PS. The short version is like this:
>
>Calculate 1/4500 with at least 34 bits of precision, then round the
>result up to have 33 bits.
>
>If the final bit is zero, you're in luck, because you can then save a
>cycle or two.
>....
>2^44/4500 = 3909374676.53689

Is that 2 to the 44th power.
And does it work with any number value.

Thanks.

Terje Mathisen

unread,
Feb 10, 2004, 8:43:02 AM2/10/04
to
Andrew Kennedy wrote:

Yes and maybe: the power needs to be 32 + the largest power of two
that's less than the divisor (4096 in the case of 4500).

Read the Agner Fog reference!

Terje

Tim Roberts

unread,
Feb 10, 2004, 1:36:58 PM2/10/04
to
"Andrew Kennedy" <andrewk...@LOGev1.net> wrote:
>>
>>PS. The short version is like this:
>>
>>Calculate 1/4500 with at least 34 bits of precision, then round the
>>result up to have 33 bits.
>>
>>If the final bit is zero, you're in luck, because you can then save a
>>cycle or two.
>>....
>>2^44/4500 = 3909374676.53689
>
>Is that 2 to the 44th power.

Yes, caret (^) is the C exponentiation operator.

>And does it work with any number value.

Any value up to 32 bits. This might not be obvious from the code:

>>So, to divide by 4500 you can use this instead:
>>
>>; Assume number to be divided in EAX:
>>
>> mov edx,3909374677
>> mul edx
>> shr edx,12
>>
>>; At this point EDX contains the correctly truncated result of
>>; dividing EAX by 4500

The key to understanding this code is to realize that this instruction:
mul edx
multiplies eax by edx and produces a 64-bit result in edx:eax. That 64-bit
result has your answer times 2 to the 44th power. We ignore eax, which
throws away the lowest 32 bits, and then shift edx right by 12 more, which
produces the correct result.
--
- Tim Roberts, ti...@probo.com
Providenza & Boekelheide, Inc.

Nuutti Kuosa

unread,
Feb 10, 2004, 1:37:05 PM2/10/04
to
Here is one example:
http://www.azillionmonkeys.com/qed/adiv.html

Yours,

Nuutti

Spike

unread,
Feb 10, 2004, 1:37:08 PM2/10/04
to
How can multiply be faster then divide? , in my opinion they should be
running in the same speed since repeated addition probaly would take the
same time to complete as repeated subtraction.

How fast is multiply compared to divide?

//Spike

"Terje Mathisen" <terje.m...@hda.hydro.com> skrev i meddelandet
news:c08r95$9vb$1...@osl016lin.hda.hydro.com...

Andrew Kennedy

unread,
Feb 10, 2004, 1:38:40 PM2/10/04
to

Terje Mathisen <terje.m...@hda.hydro.com> wrote:

>Andrew Kennedy wrote:
>> Is that 2 to the 44th power.
>> And does it work with any number value.
>
>Yes and maybe: the power needs to be 32 + the largest power of two
>that's less than the divisor (4096 in the case of 4500).
>
>Read the Agner Fog reference!
>
>Terje

I am currently reading it and getting a lot of good optimization
ideas. There is a lot of material, but it's given me an incentive
to go on to using 32 bit regs, alignment, and optimization
based on processor present.

Thanks.

Andrew Kennedy

unread,
Feb 10, 2004, 3:45:40 PM2/10/04
to

Tim Roberts <ti...@probo.com> wrote:
>"Andrew Kennedy" <andrewk...@LOGev1.net> wrote:
>>>
>>>PS. The short version is like this:
>>>
>>>Calculate 1/4500 with at least 34 bits of precision, then round the
>>>result up to have 33 bits.
>>>
>>>If the final bit is zero, you're in luck, because you can then save a

>>>cycle or two.
>>>....
>>>2^44/4500 = 3909374676.53689
>>
>>Is that 2 to the 44th power.
>
>Yes, caret (^) is the C exponentiation operator.
>
>>And does it work with any number value.
>
>Any value up to 32 bits. This might not be obvious from the code:

Is that the same as any number that will fit into a quadword ?

>>>So, to divide by 4500 you can use this instead:
>>>
>>>; Assume number to be divided in EAX:
>>>
>>> mov edx,3909374677
>>> mul edx
>>> shr edx,12
>>>
>>>; At this point EDX contains the correctly truncated result of
>>>; dividing EAX by 4500
>
>The key to understanding this code is to realize that this instruction:
> mul edx
>multiplies eax by edx and produces a 64-bit result in edx:eax. That 64-bit
>result has your answer times 2 to the 44th power. We ignore eax, which
>throws away the lowest 32 bits, and then shift edx right by 12 more, which
>produces the correct result.

Thanks, your explanation made a lot of sense.


Kevin G. Rhoads

unread,
Feb 10, 2004, 4:54:00 PM2/10/04
to
>How can multiply be faster then divide? , in my opinion they should be
>running in the same speed since repeated addition probaly would take the
>same time to complete as repeated subtraction.

IF repeated add/sub were used then there would be something to this
argument. But the basic multiply hardware does something like multiple
digit muliplying as taught in grade school (only base-2 instead of base-10)
and similarly basic divide hardware does the equivalent of long-division.

Of course, modern processors don't even do the basic forms, there are all
sorts of speed-up tricks that can be, and are, applied.

But for multiply, the sub-parts can be done in parallel and then all added,
but for long-division, the each successive sub-part depends on the one previous,
and while tricks can get around much of that inherent serialization, not all of it
is avoidable.

In the end, hardware assisted divide takes longer, on average, than hardware
assisted multiply. But not in the ratio it used to, nor as much so as comparing
long division to multiplication would suggest.

And if you need some extended precision not supported by the hardware, it gets
real bad, really fast.

Matt Taylor

unread,
Feb 10, 2004, 5:26:59 PM2/10/04
to
"Spike" <im_a...@hotmail.com> wrote in message
news:XQ_Vb.82172$dP1.2...@newsc.telia.net...

> How can multiply be faster then divide? , in my opinion they should be
> running in the same speed since repeated addition probaly would take the
> same time to complete as repeated subtraction.
>
> How fast is multiply compared to divide?
<snip>

Check the manuals. Multiply is usually <10 cycles, but it depends on the
width. A 32x32->32 multiply is faster than a 32x32->64 multiply. OTOH,
Opteron can do a 32x32->64 multiply in 3 cycles. Athlon takes 6 cycles.
Pentium-4 takes 15-18 cycles, but that is because it has no integer multiply
unit! Prescott will be much faster in that area. For comparison, the
original Pentium took 10 cycles.

Divide on all these chips takes about 40 cycles.

The difference is that multiplication can be run in parallel. For a*b, you
can set it up as a series of shifts/masks/adds that can be highly
parallelized. For example, 15*5 = 0*(15 << 3) + 1*(15 << 2) + 0*(15 << 1) +
1*(15 << 0). You can obviously compute this in log2(32) time, but it's also
possible to fully expand the expression into boolean functions and compute
it in 1 cycle. It depends on how many gates the chip designer wants to
spend.

Division is not so lucky; it has to be done as a series of sequential
shifts/subtracts. The algorithm I was given uses an iterative
double-precision shift (shld) by 1 place. It performs long division no
differently than how children are taught to do it by hand in grade school.
At the end, the modulo and quotient remain in the two registers used.

-Matt

Robert Wessel

unread,
Feb 10, 2004, 7:40:08 PM2/10/04
to
"Spike" <im_a...@hotmail.com> wrote in message news:<XQ_Vb.82172$dP1.2...@newsc.telia.net>...

> How can multiply be faster then divide? , in my opinion they should be
> running in the same speed since repeated addition probaly would take the
> same time to complete as repeated subtraction.
>
> How fast is multiply compared to divide?


It's much easier to implement fast multiplication than division. On a
P4, for example, FP multiplication (FMUL) has a latency of 7 clocks,
and FDIV 23-48, depending on precision. (I)MUL is about 14, (I)DIV
56-70.

The multiplication and division schemes we learned in school both take
O(n**2), but you can do much better. Even if you stick with the grade
school scheme (which is common in hardware), you can trivially carry
out many steps of a multiplication in parallel and complete the
operation n O(n log n), which you *cannot* do with division, which
remains O(N**2). In hardware, you can do one "n" in parallel for both
multiplication and division, which reduces the total number of steps
to log(n) for multiplication and n for division.

Further, you can implement multiplication in O(n log (n (log log n)))
with an FFT based scheme. The much simpler (to understand and
implement) Karatsuba method get you about n**1.6. Usually it's
fastest to divide by multiplying by 1/x, after computing 1/x with an
iterative newtonian approximation. So you can divide in something
like O(log n) times the cost of a multiplication.

Bob Masta

unread,
Feb 11, 2004, 8:06:34 AM2/11/04
to

Instead of shifting EDX down by 12 bits after the MUL, why not
just shift the original multiplier constant by that amount and
skip the SHR completely?


Bob Masta
dqatechATdaqartaDOTcom

D A Q A R T A
Data AcQuisition And Real-Time Analysis
www.daqarta.com

Terje Mathisen

unread,
Feb 11, 2004, 11:43:09 AM2/11/04
to
Bob Masta wrote:

Good question!

Because that loses information, i.e. bits, that are needed to make sure
that the reciprocal multiplication works for _all_ possible inputs.

If you know that your input values are bounded to less than the full
2^32 unsigned range, then you can also get away with a shorter reciprocal.

Terje

Matt Taylor

unread,
Feb 11, 2004, 12:01:11 PM2/11/04
to
"Bob Masta" <NoS...@daqarta.com> wrote in message
news:402a244...@news.itd.umich.edu...

> On Mon, 9 Feb 2004 20:48:47 +0000 (UTC), Terje Mathisen
> <terje.m...@hda.hydro.com> wrote:
>
<snip>

> >So, to divide by 4500 you can use this instead:
> >
> >; Assume number to be divided in EAX:
> >
> > mov edx,3909374677
> > mul edx
> > shr edx,12
> >
> >; At this point EDX contains the correctly truncated result of
> >; dividing EAX by 4500
> >
>
> Instead of shifting EDX down by 12 bits after the MUL, why not
> just shift the original multiplier constant by that amount and
> skip the SHR completely?

Try it with Terje's example and see:

(4500 * 3909374677) / 2^44 = 17592186046500 / 2^44 = 1
4500 * (3909374677 / 2^44) = 4500 * 0 = 0

To make it painfully obvious, try this, too:
(4500 * (3909374677 / 2)) / 2^43 = (4500 * 1954687338) / 2^43 =
8796093021000 / 2^43 = 0

The divisor is typically odd, so you can't reduce it or you will clip
precision and get the wrong answer for some numbers. You can normalize the
divisor so that bit 0 is 1, but I don't think it ever reduces further.

-Matt


Grumble

unread,
Feb 11, 2004, 2:29:34 PM2/11/04
to
Matt Taylor wrote:

> Pentium-4 takes 15-18 cycles, but that is because it has no integer
> multiply unit! Prescott will be much faster in that area.

Matt,

Be careful, Prescott *is* a Pentium 4

<g>

Emlyn Peter Corrin

unread,
Feb 11, 2004, 2:29:39 PM2/11/04
to
Tim Roberts wrote:
> "Andrew Kennedy" <andrewk...@LOGev1.net> wrote:
>>> ....
>>> 2^44/4500 = 3909374676.53689
>>
>> Is that 2 to the 44th power.
>
> Yes, caret (^) is the C exponentiation operator.
>

It is the exponentiation operator in several languages, but not in C.
In C it is the bitwise exclusive-or operator.

Emlyn


Grumble

unread,
Feb 11, 2004, 2:29:43 PM2/11/04
to
Tim Roberts wrote:

> Yes, caret (^) is the C exponentiation operator.

Errr, Tim? There is no exponentiation operator in C.

In C, the circumflex accent (ASCII code 0x5e) is the bitwise exclusive
OR operator.

What you probably meant was:

2^n can be computed as 1 << n in C syntax.

Terje Mathisen

unread,
Feb 11, 2004, 4:02:46 PM2/11/04
to
Grumble wrote:

Only sort of:

50% longer pipeline
New instructions
4 x slower L1 cache
MUL and SHx hw added
Dynamic SMT cache allocation

Probably much larger reorder/renaming pools.

Terje

Terje Mathisen

unread,
Feb 11, 2004, 4:02:49 PM2/11/04
to
Emlyn Peter Corrin wrote:

In pseudo-code used to illustrate an algorithm, 2^44 is indeed shorthand
for the exponentiation operation C doesn't have.

Terje

Matt Taylor

unread,
Feb 11, 2004, 4:38:37 PM2/11/04
to
"Terje Mathisen" <terje.m...@hda.hydro.com> wrote in message
news:c0e4ai$94e$1...@osl016lin.hda.hydro.com...

> Grumble wrote:
>
> > Matt Taylor wrote:
> >
> >> Pentium-4 takes 15-18 cycles, but that is because it has no integer
> >> multiply unit! Prescott will be much faster in that area.
> >
> >
> > Matt,
> >
> > Be careful, Prescott *is* a Pentium 4
>
> Only sort of:
>
> 50% longer pipeline
> New instructions
> 4 x slower L1 cache
> MUL and SHx hw added
> Dynamic SMT cache allocation
>
> Probably much larger reorder/renaming pools.

He was referring to the marketting name. Prescott is being marketted as
Pentium-4E. For some time (a year?) I have expected Prescott to be marketted
as Pentium-5, so my naming system is a little unconventional.

By the way, are you sure the L1 cache is 4x slower? I thought it was only
2x. The old round-trip latency was 2 cycles, and from what I have read the
new round-trip latency is 4 cycles. (Huh? How can it be slower than Athlon's
3-cycle L1 cache?) That still doesn't make any sense to me. The original
purpose of the small 8KB cache was to keep the latency low.

-Matt

Terje Mathisen

unread,
Feb 11, 2004, 6:04:57 PM2/11/04
to
Matt Taylor wrote:
> By the way, are you sure the L1 cache is 4x slower? I thought it was only

Now, my only info about this is from comp.arch discussions, where the 4
cycle figure seemed to be the load-to-use latency and not the
back-to-back linked list reference latency.

If even the P4 is 2 cycles, then 4 isn't _that_ much worse, but it
doesn't seem like a good choise.

Robert Wessel

unread,
Feb 11, 2004, 6:22:03 PM2/11/04
to
Grumble <inv...@kma.eu.org> wrote in message news:<402A0072...@kma.eu.org>...


Well, Intel's marketing department may call it a P4, but it sure
doesn't look much like a Northwood or Willie. And the more we see of
it the stranger it becomes. In any event, they did appear to add a
proper integer multiplier to Prescott.

Tim Roberts

unread,
Feb 12, 2004, 1:45:10 AM2/12/04
to

Doh! I knew that. Thanks for pointing it out.

I suppose I expose my programming roots by saying I write exponentiation in
pseudo-code as 2**43.

Tim Roberts

unread,
Feb 12, 2004, 1:45:14 AM2/12/04
to
"Matt Taylor" <pa...@tampabay.rr.com> wrote:

>"Bob Masta" <NoS...@daqarta.com> wrote in message
>>

>> Instead of shifting EDX down by 12 bits after the MUL, why not
>> just shift the original multiplier constant by that amount and
>> skip the SHR completely?
>
>Try it with Terje's example and see:
>
>(4500 * 3909374677) / 2^44 = 17592186046500 / 2^44 = 1
>4500 * (3909374677 / 2^44) = 4500 * 0 = 0

Actually, his suggestion was just to eliminate the final shift by 12 -- you
get the other 32 bit shift for free. However, it turns out the answer is
still wrong:

4500 * (3909374677 / 2^12) = 4500 * 954437 = 4294966500 / 2^32 = 0

Grumble

unread,
Feb 12, 2004, 1:06:56 PM2/12/04
to
[ Message cross-posted to comp.arch ]

Matt Taylor wrote:

> Terje Mathisen wrote:
>
>> Grumble wrote:
>>
>>> Matt Taylor wrote:
>>>
>>>> Pentium-4 takes 15-18 cycles, but that is because it has no integer
>>>> multiply unit! Prescott will be much faster in that area.
>>>
>>> Matt,
>>>
>>> Be careful, Prescott *is* a Pentium 4
>>
>> Only sort of:
>>
>> 50% longer pipeline
>> New instructions
>> 4 x slower L1 cache
>> MUL and SHx hw added
>> Dynamic SMT cache allocation
>>
>> Probably much larger reorder/renaming pools.
>

> He was referring to the marketing name. Prescott is being
> marketed as Pentium-4E.

Indeed. I wrote tongue-in-cheek :-)

> [...] and from what I have read the new round-trip latency is 4 cycles.


> (Huh? How can it be slower than Athlon's 3-cycle L1 cache?) That still
> doesn't make any sense to me. The original purpose of the small 8KB
> cache was to keep the latency low.

Let's see...

K7 L1 data cache:
64 KB
2-way set associative
3-cycle latency @ 2.2 Ghz = 1.36 ns

Northwood L1 data cache:
8 KB
4-way set associative
2-cycle latency @ 3.2 Ghz = 0.625 ns

Prescott L1 data cache:
16 KB
8-way set associative
4-cycle latency @ 5 (??) Ghz = 0.8 ns

In Prescott, size and associativity have been doubled, compared to
Northwood. As far as I understand, both factors make the cache slower
to access. If Intel intends to scale Prescott to 5 GHz, they had to
increase the cache's cycle-latency, and they would still have to use
faster SRAM than AMD (0.8 ns vs 1.36 ns).

How much influence did these changes have on cache latency:

cache size doubled
associativity doubled
process change from 130 nm to 90 nm

Grumble

unread,
Feb 12, 2004, 1:07:00 PM2/12/04
to
Terje Mathisen wrote:

> 50% longer pipeline
> New instructions
> 4 x slower L1 cache
> MUL and SHx hw added
> Dynamic SMT cache allocation
>
> Probably much larger reorder/renaming pools.

Do you believe there are more than 128 physical registers in Prescott?

Matt Taylor

unread,
Feb 12, 2004, 1:40:05 PM2/12/04
to
"Grumble" <inv...@kma.eu.org> wrote in message
news:402B5015...@kma.eu.org...

There are 256 now.

-Matt

Terje Mathisen

unread,
Feb 12, 2004, 4:11:09 PM2/12/04
to
Grumble wrote:

If 128 is the P4 number, then Yes, I'd expect this to be increased for a
cpu with 50% longer pipeline.

Iain McClatchie

unread,
Feb 12, 2004, 9:48:31 PM2/12/04
to
> Northwood L1 data cache:
> 8 KB
> 4-way set associative
> 2-cycle latency @ 3.2 Ghz = 0.625 ns
>
> Prescott L1 data cache:
> 16 KB
> 8-way set associative
> 4-cycle latency @ 5 (??) Ghz = 0.8 ns

Writing it out in ns is definitely the right way to think about
latency. You need to specify whether you are subtracting a cycle
for issuing the load (this used to be common when we had scalar
processors, superscalar just makes this practice confusing). In
any case, I suspect these numbers are not completely right, even
aside from the fact that you're quoting 5 GHz (10 GHz fireball),
which has not been announced or shipped (but probably is what they
were targetting).

I would like to see some definitive numbers here. When I earlier
stated that Prescott had a 4-cycle latency, I was going off an
article on the web that did not distinguish between the main clock
(currently 3.2 GHz) and the fireball clock (currently 6.4 GHz).
That article showed Northwood with a latency of 1 cycle, and
Prescott with 4.

I don't have access to a P4, Northwood or Prescott. Can someone
who does run a latency test (like McVoy's lmbench) and post the
results, please?

Grumble

unread,
Feb 13, 2004, 1:53:45 PM2/13/04
to
Matt Taylor wrote:

> Grumble wrote:
>
>> Do you believe there are more than 128 physical registers in Prescott?
>
> There are 256 now.

Where did you read that? :-)

Matt Taylor

unread,
Feb 13, 2004, 1:53:53 PM2/13/04
to
"Iain McClatchie" <iai...@truecircuits.com> wrote in message
news:45022fc8.04021...@posting.google.com...

> > Northwood L1 data cache:
> > 8 KB
> > 4-way set associative
> > 2-cycle latency @ 3.2 Ghz = 0.625 ns
> >
> > Prescott L1 data cache:
> > 16 KB
> > 8-way set associative
> > 4-cycle latency @ 5 (??) Ghz = 0.8 ns
>
> Writing it out in ns is definitely the right way to think about
> latency. You need to specify whether you are subtracting a cycle
> for issuing the load (this used to be common when we had scalar
> processors, superscalar just makes this practice confusing). In
> any case, I suspect these numbers are not completely right, even
> aside from the fact that you're quoting 5 GHz (10 GHz fireball),
> which has not been announced or shipped (but probably is what they
> were targetting).

The only parts running 10 GHz are the double-clocked ALUs. The L/S hardware
and caches are only running 5 GHz. Intel has not shipped a 5 GHz part, but
that is their intention. Prescott will hit 4 GHz before the end of this
year. I have seen that claimed both in print as well as on an unofficial
roadmap (http://www.rabbithole.cc/roadmap.gif) Tejas is supposedly a
dual-core Prescott.

> I would like to see some definitive numbers here. When I earlier
> stated that Prescott had a 4-cycle latency, I was going off an
> article on the web that did not distinguish between the main clock
> (currently 3.2 GHz) and the fireball clock (currently 6.4 GHz).
> That article showed Northwood with a latency of 1 cycle, and
> Prescott with 4.

Yes, I saw those figures as well. Northwood has a 2-cycle access time, not
1-cycle. Anandtech lists both figures
(http://www.anandtech.com/cpu/showdoc.html?i=1956&p=8), but they go with the
1-cycle figure rather than doing their homework. 2-cycles is what was quoted
on RWT.
(http://www.realworldtech.com/page.cfm?ArticleID=RWT091000000000&p=4)

> I don't have access to a P4, Northwood or Prescott. Can someone
> who does run a latency test (like McVoy's lmbench) and post the
> results, please?

I have access to a Northwood, but it's stuck running Windows, and I don't
see any evidence that lmbench will compile. ScienceMark reports 2 cycles
latency for the L1 cache.

Intel is shipping Prescott, but I haven't seen it for sale anywhere.

-Matt

Iain McClatchie

unread,
Feb 13, 2004, 7:39:40 PM2/13/04
to
Matt> The only parts running 10 GHz are the double-clocked ALUs. The L/S
Matt> hardware and caches are only running 5 GHz.

Are you absolutely sure about that? Is it true of both Northwood and
Prescott? The implication is that the current 3.2 GHz machines do just
3.2 billion loads per second from L1, whereas a K7 at 2 GHz can do nearly
4 billion (maybe 3.5 billion). I had previously been under the impression
that the P4 had more L1 bandwidth than the K7.

Matt> Yes, I saw those figures as well. Northwood has a 2-cycle access time,
Matt> not 1-cycle. Anandtech lists both figures
Matt> (http://www.anandtech.com/cpu/showdoc.html?i=1956&p=8), but they go
Matt> with the 1-cycle figure rather than doing their homework. 2-cycles is
Matt> what was quoted on RWT.
Matt> (http://www.realworldtech.com/page.cfm?ArticleID=RWT091000000000&p=4)

That makes a lot more sense. And you are speaking of 5 GHz (future) cycles,
so Prescott is just twice the latency of Northwood, rather than 4x which
seemed so ridiculous.

Of course, it does mean Prescott can execute 16 ALU ops (8 serial) in the
time it takes to do 1 load (and issue 4 independent). You can really see
why those ALUs are getting no traction.

It also means the current 3.2GHz Prescott has a 1.25ns primary cache
latency, versus 1.5ns in the K7 at 2GHz. By the end of the year, if they
make their schedule, Intel will be back at 1ns for their primary again, but
will not approach the 625ps they are shipping now for the next two years!

Northwood's REE is amazing.

Matt Taylor

unread,
Feb 13, 2004, 11:11:11 PM2/13/04
to
"Iain McClatchie" <iai...@truecircuits.com> wrote in message
news:45022fc8.04021...@posting.google.com...
> Matt> The only parts running 10 GHz are the double-clocked ALUs. The L/S
> Matt> hardware and caches are only running 5 GHz.
>
> Are you absolutely sure about that? Is it true of both Northwood and
> Prescott? The implication is that the current 3.2 GHz machines do just
> 3.2 billion loads per second from L1, whereas a K7 at 2 GHz can do nearly
> 4 billion (maybe 3.5 billion). I had previously been under the impression
> that the P4 had more L1 bandwidth than the K7.

Pentium-4 can issue one load and one store per cycle, both up to 16 bytes.
K7 can issue any combination of loads or stores up to two per cycle, both 8
bytes. This information for the Pentium-4 is given in the optimization
manual. In my copy (248966-009), it is page 1-22. I don't see a copy updated
for Prescott yet, but I presume the L/S hardware has not changed aside from
buffer increases.

> Matt> Yes, I saw those figures as well. Northwood has a 2-cycle access
time,
> Matt> not 1-cycle. Anandtech lists both figures
> Matt> (http://www.anandtech.com/cpu/showdoc.html?i=1956&p=8), but they go
> Matt> with the 1-cycle figure rather than doing their homework. 2-cycles
is
> Matt> what was quoted on RWT.
> Matt>
(http://www.realworldtech.com/page.cfm?ArticleID=RWT091000000000&p=4)
>
> That makes a lot more sense. And you are speaking of 5 GHz (future)
cycles,
> so Prescott is just twice the latency of Northwood, rather than 4x which
> seemed so ridiculous.

Yep.

> Of course, it does mean Prescott can execute 16 ALU ops (8 serial) in the
> time it takes to do 1 load (and issue 4 independent). You can really see
> why those ALUs are getting no traction.

This is exactly why I was shocked to hear it. Perhaps they will fix it with
the next iteration of chips much as they seem to have done with the
shifter/multiplier in Prescott. (Yay!)

I had heard that Intel initially planned to increase to a 16K u-op cache
with additional execution bandwidth (now slated for Potomac/Nocona??), but
that apparently got cut. Perhaps the larger, slower dcache was part of that
package which was supposed to benefit Hyperthreading.

I have heard rumors of 64-bit extensions in Prescott to compete with
Opteron. If that is true and x86-64 takes off, then the cache latency will
be mitigated.

> It also means the current 3.2GHz Prescott has a 1.25ns primary cache
> latency, versus 1.5ns in the K7 at 2GHz. By the end of the year, if they
> make their schedule, Intel will be back at 1ns for their primary again,
but
> will not approach the 625ps they are shipping now for the next two years!
>
> Northwood's REE is amazing.

I'm not familiar with that acronym; what's the term?

-Matt

Matt Taylor

unread,
Feb 14, 2004, 2:04:26 PM2/14/04
to
"Grumble" <inv...@kma.eu.org> wrote in message
news:c0i4hm$ch1$1...@news-rocq.inria.fr...

http://www.chip-architect.com/news/2003_03_23_Prescott_doubles_inflight_uOps.html

-Matt

Iain McClatchie

unread,
Feb 14, 2004, 7:56:58 PM2/14/04
to
Matt> Pentium-4 can issue one load and one store per cycle, both up to 16 bytes.

Meaning???
- Two agen units, one for loads, one for stores, can both calculate
addresses at the same time. Clearly the store will hold off writing
data into the 16KB SRAM until graduation. Seems reasonable, but it
means that the primary cache bandwidth isn't necessarily better than K8
unless you use SSE loads/stores.
- The 16KB SRAM can do a 16B read and a 16B write in the same cycle.
Seems doubtful.
- The 16KB SRAM is banked/replicated, and the load only goes to one
bank/replicate, so that the store data queue eventually writes to
both, but on different cycles. I personally like this idea, but I
doubt anyone's done it yet.

I'm frankly amazed they can do a read in the 300ps after a write --
the RC on the bit lines in precharge must be less than 50ps, including
the driver == ~200um of wire + zero-resistance driver. Hmm.. L1 cache
looks like it's 95um x 155um in the chip-architect.com picture, so if
that size is right I suppose 100um bitlines are possible. Wire pitch
would have to be < 160nm, though, so something is wrong with these numbers.

Matt> I have heard rumors of 64-bit extensions in Prescott to compete with
Matt> Opteron. If that is true and x86-64 takes off, then the cache latency
Matt> will be mitigated.

Huh? Do you mean the L1 Dcache latency won't matter so much because
with 16 registers there will be a lot fewer memory ops?

It's been too long since I've played with an architectural simulator, but
I seem to remember that the "Alien" folks at SGI circa 1996 were
contemplating an OoO machine with a load latency of 3 cycles, and two
ld/st and at least two ALUs per cycle. The MIPS architecture has 32 regs,
and they seemed to think that 3 cycle latency was okay, but 4 cycles led
to more execution stalls than it saved in better hit rate.

So, I don't think the extra 8 registers will mitigate the extra dcache
latency very much.

Iain> Northwood's REE is amazing.

Matt> I'm not familiar with that acronym; what's the term?

Intel marketing, sorry. "Rapid Execution Engine", a.k.a. Fireball --
the thing running at 6.4 GHz right now.

Nudge

unread,
Feb 15, 2004, 4:38:34 AM2/15/04
to
Matt Taylor wrote:

> Grumble wrote:
>
>> Matt Taylor wrote:
>>
>>> Grumble wrote:
>>>
>>>> Do you believe there are more than 128 physical registers
>>>> in Prescott?
>>>
>>> There are 256 now.
>>
>> Where did you read that? :-)
>
> http://www.chip-architect.com/news/2003_03_23_Prescott_doubles_inflight_uOps.html

Hans de Vries wrote that article almost a year ago. His information
didn't come from Intel, he took a guess based on several pictures of
a Prescott die.

Regards,

Nudge

Christian Bau

unread,
Feb 16, 2004, 1:33:51 PM2/16/04
to
In article <45022fc8.04021...@posting.google.com>,
iai...@truecircuits.com (Iain McClatchie) wrote:

> Of course, it does mean Prescott can execute 16 ALU ops (8 serial) in the
> time it takes to do 1 load (and issue 4 independent). You can really see
> why those ALUs are getting no traction.

Can they actually do that? I thought there was a limit of 3 microops per
cycle.

Mitch Alsup

unread,
Feb 16, 2004, 2:56:32 PM2/16/04
to
"Matt Taylor" <pa...@tampabay.rr.com> wrote in message news:<hffXb.112034$fH2....@twister.tampabay.rr.com>...

> "Iain McClatchie" <iai...@truecircuits.com> wrote in message
> news:45022fc8.04021...@posting.google.com...
> > Matt> The only parts running 10 GHz are the double-clocked ALUs. The L/S
> > Matt> hardware and caches are only running 5 GHz.
> >
> > Are you absolutely sure about that? Is it true of both Northwood and
> > Prescott? The implication is that the current 3.2 GHz machines do just
> > 3.2 billion loads per second from L1, whereas a K7 at 2 GHz can do nearly
> > 4 billion (maybe 3.5 billion). I had previously been under the impression
> > that the P4 had more L1 bandwidth than the K7.
>
> Pentium-4 can issue one load and one store per cycle, both up to 16 bytes.
> K7 can issue any combination of loads or stores up to two per cycle, both 8
> bytes. This information for the Pentium-4 is given in the optimization
> manual. In my copy (248966-009), it is page 1-22. I don't see a copy updated
> for Prescott yet, but I presume the L/S hardware has not changed aside from
> buffer increases.

Not quite: K7 can issue 3 memory address computations from the schedulers
every cycle, however, only 2 can be processed in the data cache per cycle;
with excess memory references being held in the LS1 until such time as they
can obtain a data cache port.

> -Matt

Mitch

Matt Taylor

unread,
Feb 16, 2004, 6:51:23 PM2/16/04
to
"Christian Bau" <christ...@cbau.freeserve.co.uk> wrote in message
news:christian.bau-162...@slb-newsm1.svr.pol.co.uk...

His numbers do not take into account the TC bandwidth limitation. Prescott
can issue 12 u-ops in 4 cycles, the latency of an L1 dcache access. If you
look at the raw execution bandwidth, both double-speed ALUs can jointly
execute 16 u-ops in that amount of time. Because they are double-speed, they
can execute 8 back-to-back *dependent* ALU ops. The point is that the
schedulers will not be able to keep the execution units full because the
latency of the L1 cache is too high.

-Matt

Matt Taylor

unread,
Feb 17, 2004, 2:12:30 PM2/17/04
to
"Iain McClatchie" <iai...@truecircuits.com> wrote in message
news:45022fc8.04021...@posting.google.com...
> Matt> Pentium-4 can issue one load and one store per cycle, both up to 16
bytes.
>
> Meaning???
> - Two agen units, one for loads, one for stores, can both calculate
> addresses at the same time. Clearly the store will hold off writing
> data into the 16KB SRAM until graduation. Seems reasonable, but it
> means that the primary cache bandwidth isn't necessarily better than
K8
> unless you use SSE loads/stores.

That came out of the PDF describing Williamette and Northwood. I checked
today, and the PDF has been updated for Prescott. It still claims one load
and one store per cycle. Some older non-official, non-Intel literature
claims that Prescott now runs the L/S and caches at double-speed, but the
Intel docs don't mention that.

> - The 16KB SRAM can do a 16B read and a 16B write in the same cycle.
> Seems doubtful.

<snip>

I spent several days looking for an answer on this point, and I could not
find any information at all. However, point (d) in figure 2-2 on page 2-32
says that a 16-byte STLF must be 16-byte aligned. In AMD's implementation, I
don't think that matters. A movdqa simply issues a pair of 8-byte loads.
Perhaps Intel does the same and the manual is wrong; I don't know.

> Matt> I have heard rumors of 64-bit extensions in Prescott to compete with
> Matt> Opteron. If that is true and x86-64 takes off, then the cache
latency
> Matt> will be mitigated.
>
> Huh? Do you mean the L1 Dcache latency won't matter so much because
> with 16 registers there will be a lot fewer memory ops?

Yes, although that is an overstatement of what I think. On a processor with
7 effective registers, 4-cycles to the L1 cache is an atrocity. Even with
hand-tuned assembly, there is no way to completely hide that latency most of
the time. I find myself frequently lacking only 1 or 2 registers for live
variables. On Prescott, I will pay dearly for that. On "Prescott-64" I more
frequently will be able to keep all live variables in registers and avoid
that penalty. That also gives me more freedom to schedule loads earlier
since there will be less contention for the L/S.

> It's been too long since I've played with an architectural simulator, but
> I seem to remember that the "Alien" folks at SGI circa 1996 were
> contemplating an OoO machine with a load latency of 3 cycles, and two
> ld/st and at least two ALUs per cycle. The MIPS architecture has 32 regs,
> and they seemed to think that 3 cycle latency was okay, but 4 cycles led
> to more execution stalls than it saved in better hit rate.
>
> So, I don't think the extra 8 registers will mitigate the extra dcache
> latency very much.

I have no doubt that Intel would have been better off with the old 8KB
cache, but I also have no doubt that a significant number of memory accesses
will be eliminated in x86-64. For starters, both Linux and Windows will be
using register calling conventions. I was told that the beta of Win64 on
Opteron is 10% faster in 32-bit graphics/games than its pure 32-bit
counterpart. It would seem to be related to the virtues of the x86-64
architecture more than the Opteron itself. If Opteron gains >10% with
3-cycles cache latency, "Prescott-64" will get even more.

I do hope they fix the cache in a future iteration. They fixed the shifter
in Prescott; the multiplier is still slow.

> Iain> Northwood's REE is amazing.
>
> Matt> I'm not familiar with that acronym; what's the term?
>
> Intel marketing, sorry. "Rapid Execution Engine", a.k.a. Fireball --
> the thing running at 6.4 GHz right now.

Ah, right. Specifically the double-clocked path with the ALUs, registers,
and schedulers. There is some funny business going on there. On Prescott all
ALU ops now have a latency of 1 clk. Logical ops have a throughput of 0.5
clk while arithmatic ops have a throughput of 1 clk. It would seem the REE
isn't quite so amazing now. :-(

-Matt

Greg Lindahl

unread,
Feb 17, 2004, 3:39:23 PM2/17/04
to
In article <y_lYb.219047$fH2....@twister.tampabay.rr.com>,
Matt Taylor <pa...@tampabay.rr.com> wrote:

> In AMD's implementation, I
> don't think that matters. A movdqa simply issues a pair of 8-byte loads.
> Perhaps Intel does the same and the manual is wrong; I don't know.

In AMD's implementation, the load has to be 128-bit aligned.

-- greg

Matt Taylor

unread,
Feb 17, 2004, 10:31:54 PM2/17/04
to
"Greg Lindahl" <lin...@pbm.com> wrote in message
news:40326fb9$1...@news.meer.net...

Not an MMX load, nor an integer load. I was talking about using movdqa to
store on a 16-byte boundary and then trying to load from the high half. I
ran a little test on my 1.4 GHz Palomino, and it can STLF just fine even
when the subsequent load is not 16-byte aligned.

I repeated the following sequences of code 12 times each and measured the
number of cycles each took to execute:

; Test 1: 16-byte aligned store, 16-byte aligned load
movaps [esi], xmm0
movq mm0, [esi]

; Test 2: 16-byte aligned store, 8-byte aligned load
movaps [esi], xmm0
movq mm0, [esi+8]

; Test 3: 8-byte aligned store, 8-byte aligned load
movq [esi], mm0
mov eax, [esi]

; Test 4: 8-byte aligned store, 4-byte aligned load
movq [esi], mm0
mov eax, [esi+4]

The last 2 tests were mainly to show that my timing framework can detect
STLF stalls. Here are the results:

Test 1: 53 cycles
Test 2: 57 cycles
Test 3: 27 cycles
Test 4: 127 cycles

This is pretty much what I expected. On K7 there is no 16-byte alignment
restriction on a load to STLF an SSE store.

I ran the same code on a Northwood as well:

Test 1: 44 cycles
Test 2: 468 cycles
Test 3: 44 cycles
Test 4: 60 cycles

The code isn't scheduled for a Northwood, but I can't think of a good way to
do that without hiding the STLF stalls. It seems MMX and SSE loads/stores
all take 6 cycles to complete. This code also demonstrates my original claim
that Northwood can load and store 16-bytes in the cache each cycle. I
presume Prescott can as well, but again I can't find any docs that say
anything explicitly about the L/S width.

-Matt

Iain McClatchie

unread,
Feb 19, 2004, 4:26:20 AM2/19/04
to
I got back from ISSCC today. I'll try to write up a little of what I learned
that's relavent to comp.arch later. But:

After the Prescott presentation (3.4), I asked about the primary cache latency.
Specifically, I pointed out that with the numbers presented (5.5 GHz in two
years), they would be shipping a 700 ps L0 Dcache (that's what they call it)
two years from now, and are currently shipping 625ps. The speaker (C. Webb?)
told me it didn't matter.

I talked with D. Deleganes seperately, who claims to have led the effort on
the Prescott FCLK core. Currently, MCLK = 3.2 GHz, FCLK = 6.4 GHz. Mr.
Deleganes gave an enlightening presentation on the circuits used in the FCLK
core -- I assume they're patented out the wazoo. Mr. Deleganes confirmed
that the L0 Dcache latencies don't matter. A friend of his, who had worked
on Willamette and Northwood and Centrino, but who no longer works for Intel,
suggested without arousing counterargument that Mr Deleganes had been assigned
to make those circuits cycle at 8-12 GHz, latency be damned. Mr Deleganes
pointed out several times that the new circuits are much less finicky than
the Willamette and Northwood circuits. He seemed remarkably unfamiliar with
the L0 cache latencies, commenting several times that the latency depended
on where you started counting from. (I said, you pick a spot, walk around a
loop, get back to where you started, and tell me the latency.) I think the
trouble may have come from the fact that the AGEN runs in the FCLK domain,
but can only kick out a load every other cycle.

A (the?) Prescott cache designer was also at this conversation. I asked if
there was some sort of mid-project change that explained the large cache
latency and modest size increase. He pointed out that cache design is a fairly
mature field, you wouldn't expect surprises. So, basically, I was being
stonewalled, but it does sound as if they deliberately chose GHz over latency.
I get the feeling the guys I was talking to weren't very familiar with the
architectural implications.

The Prescott cache runs does one load and one store every MCLK. It wasn't
clear if this is done off FCLK, or just alternate phases of MCLK, but that
doesn't much matter from a code execution point of view. Mr. Deleganes did
say that loads were aligned, stores were unaligned, but perhaps he meant the
reverse (now I'm not so sure I can remember which way he said was which).

The following is what I've assembled from several conversations:

Willamette and Northwood used self-resetting Domino logic. The Northwood
core was shrunk from Willamette; apparently this caused a great deal of pain.
The issue is that self-resetting domino gates generate pulses of small width.
When these pulses arrive at another self-resetting domino gate, the output
pulse is generally shorter than the input pulses, and gets shorter still if
the input pulses do not arrive simultaneously. It took a lot of work to tune
these circuits to work. Apparently probing the silicon to help debug runtime
failures is NOT considered feasible -- silicon debug is limited to screwing
with the clock to try to find the failing path, and paths that don't work at
any speed (and give zero debug informatino) are likely.

>From the presentation:
Prescott uses low swing logic. They have true and complement inverters,
driving true and complement nmos-pass-gate structures (up to 6 xtrs in series),
going into a sense amp which drives a domino gate. That appears to be the
entire FCLK cycle. The inputs to the sense amps typically swing tens of
millivolts. All interior nodes to the pass gate structure are reset low, and
a cutoff transistor after the initial inverter prevents it from sneaking any
charge in there. The pass gate nmos gates are driven full-rail. The load
alignment mux is apparently one big merged nmos pass gate structure with over
5000 transistors. This sounds amazing, but I've got designs from 1997 with
almost exactly the same thing, and it's not a big deal... shifters just end up
like that.

The big advantage of this logic is that it develops more signal over time, so
a given chip will usually run at some speed, rather than just failing outright
like self-resetting domino. This should help them port to multiple voltages,
deliberate process variations (mobile parts, etc).

In paper 24.6, a few guys from Intel showed how they were able to build a
100 GHz VCO (voltage controlled oscillator, the thing in the PLL that goes
tick-tock) on their 90 nm process. This corresponds to a free-space wavelength
of 300 microns -- in copper through silicon dioxide I'm sure it's more like
40-50 microns. It's just a curiosity, but... yow.

Considering the size of the FCLK region, that means that Prescott is running
at ten percent or more of, literally, the speed of light. Mark Horowitz gave
a talk saying that we've pretty convincingly smacked into the wall on speed
and power, and how life will be interesting anyway. My prediction for next
year's keynote:

Photons Won't Help

Dr John Stockton

unread,
Feb 19, 2004, 7:02:17 PM2/19/04
to
JRS: In article <45022fc8.04021...@posting.google.com>, seen
in news:comp.lang.asm.x86, Iain McClatchie <iai...@truecircuits.com>
posted at Thu, 19 Feb 2004 09:26:20 :-

>
>In paper 24.6, a few guys from Intel showed how they were able to build a
>100 GHz VCO (voltage controlled oscillator, the thing in the PLL that goes
>tick-tock) on their 90 nm process. This corresponds to a free-space wavelength
>of 300 microns --

300 um * 100 GHz = 3E7 m/s. But c = 3E8 m/s. ???

--
© John Stockton, Surrey, UK. ?@merlyn.demon.co.uk / ??.Stoc...@physics.org ©
Web <URL:http://www.merlyn.demon.co.uk/> - FAQish topics, acronyms, & links.
Correct <= 4-line sig. separator as above, a line precisely "-- " (SoRFC1036)
Do not Mail News to me. Before a reply, quote with ">" or "> " (SoRFC1036)

Terje Mathisen

unread,
Feb 20, 2004, 2:46:47 AM2/20/04
to
Iain McClatchie wrote:

> I got back from ISSCC today. I'll try to write up a little of what I learned
> that's relavent to comp.arch later. But:
>
> After the Prescott presentation (3.4), I asked about the primary cache latency.
> Specifically, I pointed out that with the numbers presented (5.5 GHz in two
> years), they would be shipping a 700 ps L0 Dcache (that's what they call it)
> two years from now, and are currently shipping 625ps. The speaker (C. Webb?)
> told me it didn't matter.
>
> I talked with D. Deleganes seperately, who claims to have led the effort on
> the Prescott FCLK core. Currently, MCLK = 3.2 GHz, FCLK = 6.4 GHz. Mr.
> Deleganes gave an enlightening presentation on the circuits used in the FCLK
> core -- I assume they're patented out the wazoo. Mr. Deleganes confirmed
> that the L0 Dcache latencies don't matter. A friend of his, who had worked
> on Willamette and Northwood and Centrino, but who no longer works for Intel,
> suggested without arousing counterargument that Mr Deleganes had been assigned
> to make those circuits cycle at 8-12 GHz, latency be damned. Mr Deleganes
> pointed out several times that the new circuits are much less finicky than
> the Willamette and Northwood circuits. He seemed remarkably unfamiliar with
> the L0 cache latencies, commenting several times that the latency depended
> on where you started counting from. (I said, you pick a spot, walk around a
> loop, get back to where you started, and tell me the latency.) I think the
> trouble may have come from the fact that the AGEN runs in the FCLK domain,
> but can only kick out a load every other cycle.

OK, getting more headroom is a valid design criteria, but history have
shown many times that you'd better not leave too much performance on the
floor even for the first couple of revisions.


>
> A (the?) Prescott cache designer was also at this conversation. I asked if
> there was some sort of mid-project change that explained the large cache
> latency and modest size increase. He pointed out that cache design is a fairly
> mature field, you wouldn't expect surprises. So, basically, I was being
> stonewalled, but it does sound as if they deliberately chose GHz over latency.
> I get the feeling the guys I was talking to weren't very familiar with the
> architectural implications.

...

Interesting stuff, thanks for the report!

Iain McClatchie

unread,
Feb 20, 2004, 1:37:59 PM2/20/04
to
John> 300 um * 100 GHz = 3E7 m/s. But c = 3E8 m/s. ???

Yep, off by 1 decimal place. Sorry about that. I posted a correction
yesterday; google seems to have dumped it. This is the second time that's
happened to me -- have other people noticed Google dropping posts?

0 new messages