First view of Willamette

Thomas Womack

unread,

Feb 16, 2000, 3:00:00 AM2/16/00

to

On ftp://download.intel.com/design/processor/WmtSDG.pdf , you will find a
software developer's guide for Willamette.

The 1.5GHz clock speed is presumably mostly due to the 20-stage pipeline,
and is I suppose the logical way to go if processors sell themselves on
their MHz figure; the double-pumped integer ALU seems an interesting
approach, though I suppose it's comparable to the 3GHz figures IBM were
throwing about last week. Keeping decoded micro-ops in the instruction cache
(called Execution Trace Cache) is a clever idea. With a pipeline that long,
the branch prediction had better be good - and the first priority in the
code optimisation guidelines is 'improve branch predictability'. At least
indirect branches are now predicted, and branch hints have been added.

An interesting point is that shifts and integer multiplies are no longer
cheap (2 or 4 ticks for a shift, 10 for an IMUL, though I don't know whether
those are double-pumped ticks). FXCH is a hindrance, so I guess Intel are
assuming that P5-scheduled FP code will be dead by the time Willamette
appears.

I think that, on the third iteration, Intel have finally got SIMD
instructions right: you can now treat an XMM register as a pair of doubles,
a pair of 64-bit integers (signed or unsigned), four floats, four 32-bit
ints, eight words or 16 bytes. You still have to do transcendental functions
yourself, but the ideas from Intel's paper about implementing transcendental
functions on iTanium work just as well on x86. I wish they'd think of a term
other than 'double quadword' for a 128-bit quantity, though!

There's a 32x32 -> 64 unsigned multiply, and a command for doing two of
these at once; coupled with a shuffle-dwords command, this means you can at
last use SIMD instructions for multi-precision arithmetic. The claim is that
this is three times faster at a given clock speed than you can manage on a
Coppermine. It looks as if Intel would rather that you do integer work in
the integer and MMX registers, and use their SIMD instructions for FP; the
latency of SIMD is better than x87. At last - a sensible set of FP
instructions on an Intel machine, though eight registers isn't much for
unrolling.

Dr Harley will laugh at the absence of 64x64->128 multiply; I suspect the
EV68 will be noticeably faster than Willamette for MP arithmetic and for
non-vectorised FP - though equally I suspect Intel wants people keen on
those operations to buy McKinleys instead. The EV68s may still be faster
than the McKinleys, of course :).

On the whole, this is an interesting-looking chip; it'll be a year before
I'd be able to afford a Willamette system, but the release of the
documentation now has had the desired effect of making me no longer lust
after Athlons as the ultimate in performance. I'll buy the Athlon anyway,
knowing that there'll be a factor-of-two upgrade available in mid-2001 (with
'upgrade' as in 'replace motherboard, processor, power supply, cooling,
memory and case').

Tom

Carlo Razzeto

unread,

Feb 16, 2000, 3:00:00 AM2/16/00

to

Sounds interesting... I'd love to see the papers on the ThunderBird Athlon,
from what I hear they are making some changes, most notably on die full
speed cache and lots of it... I hope there are some other nice supprises...

Carlo Razzeto
"Thomas Womack" <t...@tom-womack.fsnet.co.uk> wrote in message
news:88f73t$36b$4...@news6.svr.pol.co.uk...

Rui Pedro Mendes Salgueiro

unread,

Feb 17, 2000, 3:00:00 AM2/17/00

to

In comp.arch George Herbert <gher...@crl3.crl.com> wrote:
> A little birdie is going around reporting that OS vendors
> are being told that the Willamette will require kernel
> mods to operating systems to "shut down unused parts of
> the CPU to keep it from melting".

That can't be. The first unattended OS crash would destroy the
processor.

--
http://www.mat.uc.pt/~rps/f1/ an half-tifoso until Canada 2000
Mark Sandman - Morphine, RIP (1952-1999/07/03, Italy)
.pt is Portugal| `Whom the gods love die young'-Menander (342-292 BC)
Europe | Villeneuve 50-82, Toivonen 56-86, Senna 60-94

Jan Vorbrueggen

unread,

Feb 17, 2000, 3:00:00 AM2/17/00

to

"Thomas Womack" <t...@tom-womack.fsnet.co.uk> writes:

> I wish they'd think of a term other than 'double quadword' for a
> 128-bit quantity, though!

The obvious choice is `octaword', for which there is precedent from some
twenty years ago - there's a VAX `MOVO' instruction.

Jan

Robert Harley

unread,

Feb 17, 2000, 3:00:00 AM2/17/00

to

"Thomas Womack" <t...@tom-womack.fsnet.co.uk> writes:
> [...Willamette...]

>
> An interesting point is that shifts and integer multiplies are no longer

> cheap (2 or 4 ticks for a shift, 10 for an IMUL, though I don't know [...]

Ouch!

> There's a 32x32 -> 64 unsigned multiply, and a command for doing two of
> these at once; coupled with a shuffle-dwords command, this means you can at

> last use SIMD instructions for multi-precision arithmetic. [...]

> Dr Harley will laugh at the absence of 64x64->128 multiply;

No, not at all. Of course it would be great if there was one, but
even as it is it looks rather nice. I just hope compiler support will
be forthcoming, and not just in the Intel Reference Compiler.
Compiler support for Altivec is excellent and is a model that ought to
be copied, IMO!

Bye,
Rob.

chris ulrich

unread,

Feb 17, 2000, 3:00:00 AM2/17/00

to

In article <88flep$5...@spool.cs.wisc.edu>, Andy Glew <gl...@cs.wisc.edu> wrote:
[snip]
>a) to reduce overall power consumption even in desktop mode
> - PCs consume a significant fraction of all US electricity

This is silly. Incandescent light bulbs draw as much power as an
average PC. Monitors draw more. I'm strange in that I have lots of
computers in my apartment and mostly fluorescent lights, but I'm
certain that even I use more energy providing illumination for my
apartment than I draw with my computers.
chris

Terje Mathisen

unread,

Feb 17, 2000, 3:00:00 AM2/17/00

to

Thomas Womack wrote:
>
> On ftp://download.intel.com/design/processor/WmtSDG.pdf , you will find a
> software developer's guide for Willamette.
>
> The 1.5GHz clock speed is presumably mostly due to the 20-stage pipeline,

The P6 pipeline is of comparable length.

> and is I suppose the logical way to go if processors sell themselves on
> their MHz figure; the double-pumped integer ALU seems an interesting
> approach, though I suppose it's comparable to the 3GHz figures IBM were

I wonder why they don't claim 3 GHz for Willamette, I bet we'll see this
number somewhere by the time it is released.

> throwing about last week. Keeping decoded micro-ops in the instruction cache
> (called Execution Trace Cache) is a clever idea. With a pipeline that long,

The Pentium chip did the same, only pre-decoded instructions could be
issued in parallel, so the first invocation of any code would always run
like a 486.

> the branch prediction had better be good - and the first priority in the
> code optimisation guidelines is 'improve branch predictability'. At least
> indirect branches are now predicted, and branch hints have been added.

I wonder how large the new branch prediction unit is, since Intel state
that it is

"effectively combining all currently available prediction schemes"

> An interesting point is that shifts and integer multiplies are no longer

> cheap (2 or 4 ticks for a shift, 10 for an IMUL, though I don't know whether
> those are double-pumped ticks).

I'd bet (hope?) that's so, there's no good reason to suddenly reduce the
speed of MUL to less than half of what current PIII's deliver, since
this would mean that the new multiplier is slower in real time (not
clocks) than the current chips.
...
The docs call these ticks mClk, and state that MUL is only slightly
slower than on a PIII, so it must be the internal clocks.

The 2-4 cycle shifter is strange though, since they have very fast
barrel shifters in the MMX and SSE parts of the chip, just not in the
regular integer ALU.

When implementing reciprocal mul for integer division, it might be
faster to transfer an integer register to the low part of a SSE reg, use
the PMULUDQ opcode to do the 32x32->64 multiplication, and then shift
the result down while still in the SSE reg, before transferring back?

> FXCH is a hindrance, so I guess Intel are
> assuming that P5-scheduled FP code will be dead by the time Willamette
> appears.

Not at all, they are just trying to further hint that this is the way to
go for compiler vendors.

> I think that, on the third iteration, Intel have finally got SIMD
> instructions right: you can now treat an XMM register as a pair of doubles,
> a pair of 64-bit integers (signed or unsigned), four floats, four 32-bit
> ints, eight words or 16 bytes. You still have to do transcendental functions
> yourself, but the ideas from Intel's paper about implementing transcendental

> functions on iTanium work just as well on x86. I wish they'd think of a term

> other than 'double quadword' for a 128-bit quantity, though!

I love the way they present the horrible mess of alternate register sets
and encodings as "now supports operation on 128-bit XMM registers in the
Willamette processor architecture. These enhanced integer SIMD
instructions allow software developers to have maximum flexibility to
implement algorithms by writing SIMD code with either XMM registers or
MMX registers."

I.e. the part about using 'either XMM or MMX', when in reality they are
forcing sw developers to write all their asm code at least two or three
times, to maintain any sort of backwards compatibility. :-(

BTW, what's wrong with 'octaword'?

> There's a 32x32 -> 64 unsigned multiply, and a command for doing two of
> these at once; coupled with a shuffle-dwords command, this means you can at

> last use SIMD instructions for multi-precision arithmetic. The claim is that
> this is three times faster at a given clock speed than you can manage on a
> Coppermine. It looks as if Intel would rather that you do integer work in
> the integer and MMX registers, and use their SIMD instructions for FP; the
> latency of SIMD is better than x87. At last - a sensible set of FP
> instructions on an Intel machine, though eight registers isn't much for
> unrolling.
>
> Dr Harley will laugh at the absence of 64x64->128 multiply; I suspect the
> EV68 will be noticeably faster than Willamette for MP arithmetic and for
> non-vectorised FP - though equally I suspect Intel wants people keen on
> those operations to buy McKinleys instead. The EV68s may still be faster
> than the McKinleys, of course :).
>
> On the whole, this is an interesting-looking chip; it'll be a year before
> I'd be able to afford a Willamette system, but the release of the
> documentation now has had the desired effect of making me no longer lust
> after Athlons as the ultimate in performance. I'll buy the Athlon anyway,
> knowing that there'll be a factor-of-two upgrade available in mid-2001 (with
> 'upgrade' as in 'replace motherboard, processor, power supply, cooling,
> memory and case').
>
> Tom

Here's a bug in the docs: "addressing is handled through the six
general-purpose registers"

Even gcc is capable of omitting function frame pointers, which allow
seven "general-purpose registers".

The need to dedicate (E)BP for this was removed on the 386, back in
1986.

--
- <Terje.M...@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"

Terje Mathisen

unread,

Feb 17, 2000, 3:00:00 AM2/17/00

to

Robert Harley wrote:
>
> "Thomas Womack" <t...@tom-womack.fsnet.co.uk> writes:
> > [...Willamette...]
> >

> > An interesting point is that shifts and integer multiplies are no longer

> > cheap (2 or 4 ticks for a shift, 10 for an IMUL, though I don't know [...]
>
> Ouch!

As I noted, this is really 1-2 and 5 clocks, since the numbers refer to
the double-pumped integer ALU.

When comparing to SSE operations we must divide by two.

> > There's a 32x32 -> 64 unsigned multiply, and a command for doing two of
> > these at once; coupled with a shuffle-dwords command, this means you can at

> > last use SIMD instructions for multi-precision arithmetic. [...]

> > Dr Harley will laugh at the absence of 64x64->128 multiply;
>

> No, not at all. Of course it would be great if there was one, but
> even as it is it looks rather nice. I just hope compiler support will

A full 64x64->128 should require two PMULUDQ operations, plus 2 PADDQ
operations.

It seems like you would need a bunch of shuffle/move operations as well,
and the carry propagation would further complicate things:

-) Shuffle one of the inputs (Aa) into AA and aa.

-) Calculate (AB,Ab),(aB,ab)

-) Add the middle terms together (temp = Ab + aB) and do a compare to
get a carry mask (0 or -1)
Shuffle the AB and ab products, to merge the two middle parts.

-) Add the merged number to the temp sum, do another compare to check
for carry
Propagate the first carry into the top half of the AB product

-) Shuffle the temp sum back into the top and bottom halves.

-) Do the final carry propagation.

There's probably a bunch of other ways to do the same, the 'right way'
depends upon the relative speeds and availability of the different
execution units.

> be forthcoming, and not just in the Intel Reference Compiler.
> Compiler support for Altivec is excellent and is a model that ought to
> be copied, IMO!

Indeed.

This is getting to the point where I need a large piece of paper to keep
all the different intermediate results clear. :-(

Terje

Eric C. Fromm

unread,

Feb 17, 2000, 3:00:00 AM2/17/00

to

Jan Vorbrueggen wrote:
>
> "Thomas Womack" <t...@tom-womack.fsnet.co.uk> writes:
>

> > I wish they'd think of a term other than 'double quadword' for a
> > 128-bit quantity, though!
>

> The obvious choice is `octaword', for which there is precedent from some
> twenty years ago - there's a VAX `MOVO' instruction.
>
> Jan

It seems reasonable that (as an industry) we should finally be able to
settle on some standard terminology. At least we all seem to agree on what
a byte is now (thank God). I would suggest a terminology derived from an
existing broadly accepted data format standard. Oh, say...IEEE-FP for
instance? In that case, 32 bits contain a single precision FP value. Let's
call that a Word. The double precision values are represented in 64 bits; a
Double Word datum.

All in favor say "Aye"!

-eric

--
Eric C. Fromm efr...@sgi.com
Principal Engineer Advanced Systems Division
SGI - Silicon Graphics, Inc. Chippewa Falls, Wi.

Doug Siebert

unread,

Feb 17, 2000, 3:00:00 AM2/17/00

to

Terje Mathisen <Terje.M...@hda.hydro.com> writes:

>> and is I suppose the logical way to go if processors sell themselves on
>> their MHz figure; the double-pumped integer ALU seems an interesting
>> approach, though I suppose it's comparable to the 3GHz figures IBM were

>I wonder why they don't claim 3 GHz for Willamette, I bet we'll see this
>number somewhere by the time it is released.

One Pentium IV 3000/DX2 coming up! :)

--
Douglas Siebert Director of Computing Facilities
douglas...@uiowa.edu Division of Mathematical Sciences, U of Iowa

I'm planning on being dead for most of the new millennium, how about you?

Peter Seebach

unread,

Feb 17, 2000, 3:00:00 AM2/17/00

to

In article <88gf36$pfj$1...@pravda.ucr.edu>, chris ulrich <cdu@jawa.> wrote:
> This is silly. Incandescent light bulbs draw as much power as an
>average PC. Monitors draw more. I'm strange in that I have lots of
>computers in my apartment and mostly fluorescent lights, but I'm
>certain that even I use more energy providing illumination for my
>apartment than I draw with my computers.

I can support this one; I generally have a handful of computers running. Long
ago, I had an apartment with maybe 3 "active" computers on pretty much all the
time, and we didn't even shut the monitors off. (Ugh.) We had pretty high
power bills. When the roommate who left the lights on all the time left, our
electric bill dropped by about 30% - which is pretty much the gap between
"normal" power billing and "high" power billing, in that case.

I'm betting that the microwave and coffee maker chew as much power as any of
my computers these days.

-s
--
Copyright 2000, All rights reserved. Peter Seebach / se...@plethora.net
C/Unix wizard, Pro-commerce radical, Spam fighter. Boycott Spamazon!
Consulting & Computers: http://www.plethora.net/
Get paid to surf! No spam. http://www.alladvantage.com/go.asp?refid=GZX636

Maynard Handley

unread,

Feb 17, 2000, 3:00:00 AM2/17/00

to

In article <88flep$5...@spool.cs.wisc.edu>, "Andy Glew" <gl...@cs.wisc.edu> wrote:

>> A little birdie is going around reporting that OS vendors
>> are being told that the Willamette will require kernel
>> mods to operating systems to "shut down unused parts of
>> the CPU to keep it from melting".
>>

>> I found this a bit shocking, personally, having hardware
>> which requires OS tending to prevent thermal meltdown is
>> generally considered a bad thing.
>
>Unless things have changed greatly at Intel since I was there,
>this is wrong. Nobody would be so stupid as to require OS
>changes for basic functionality.
>
>However, Intel may very well be working with OS developers
>to shut down unused parts of the chip

>a) to reduce overall power consumption even in desktop mode
> - PCs consume a significant fraction of all US electricity

So I've heard in various places. I have to admit I find this claim absurd,
compared to heating, cooling and industry.
Can you give figures along with how they were derived (and hence how
reliable they are)?

Maynard

Maynard Handley

unread,

Feb 17, 2000, 3:00:00 AM2/17/00

to

In article <y4bt5g4...@mailhost.neuroinformatik.ruhr-uni-bochum.de>,
Jan Vorbrueggen <j...@mailhost.neuroinformatik.ruhr-uni-bochum.de> wrote:

>"Thomas Womack" <t...@tom-womack.fsnet.co.uk> writes:
>
>> I wish they'd think of a term other than 'double quadword' for a
>> 128-bit quantity, though!
>
>The obvious choice is `octaword', for which there is precedent from some
>twenty years ago - there's a VAX `MOVO' instruction.
>
> Jan

Why not just ditch the whole stupid scheme and name things the way any
sane person does internally in their C--- int16, int32, int64 etc, and
likewise packed8-128, packed 16-128 and so on?

Maynard

Jeffrey S. Dutky

unread,

Feb 17, 2000, 3:00:00 AM2/17/00

to

"Eric C. Fromm" wrote:

>
> Jan Vorbrueggen wrote:
> >
> > "Thomas Womack" <t...@tom-womack.fsnet.co.uk> writes:
> >
> > > I wish they'd think of a term other than 'double quadword' for a
> > > 128-bit quantity, though!
> >
> > The obvious choice is `octaword', for which there is precedent from some
> > twenty years ago - there's a VAX `MOVO' instruction.
> >
> > Jan
>

> It seems reasonable that (as an industry) we should finally
> be able to settle on some standard terminology. At least we
> all seem to agree on what a byte is now (thank God). I would
> suggest a terminology derived from an existing broadly
> accepted data format standard. Oh, say...IEEE-FP for
> instance? In that case, 32 bits contain a single precision
> FP value. Let's call that a Word. The double precision values
> are represented in 64 bits; a Double Word datum.
>
> All in favor say "Aye"!

how about:

byte = 1 byte (8-bits)
half-word = 2 bytes (16-bits)
word = 4 bytes (32-bits)
long-word = 8 bytes (64-bits)
deca-word = 10 bytes (80-bits)
quad-word = 16 bytes (128-bits)
octa-word = 32 bytes (256-bits)

- Jeff Dutky

Corley Brigman

unread,

Feb 17, 2000, 3:00:00 AM2/17/00

to

In article <88gf36$pfj$1...@pravda.ucr.edu>, chris ulrich <cdu@jawa.>
wrote:
> This is silly. Incandescent light bulbs draw as much power as an
>average PC. Monitors draw more. I'm strange in that I have lots of
>computers in my apartment and mostly fluorescent lights, but I'm
>certain that even I use more energy providing illumination for my
>apartment than I draw with my computers.

possible. but no one said "computers use the most power
of anything today", but "computers use a significant
fraction of all power today". Even 10% is a "significant fraction".
Most PCs have at least a 150-175 watt power supply, that's
2.5-3 average 60 watt light bulbs...and you usually turn
your lights off during the day :) even many companies
i bet turn their lights off at night, but leave the computers
running....

If you have 600 watts worth of light bulbs in your house,
and 200 watts of computers, and the computers are on 24x7
and the lights are on an average of 8 hours a day,
they are using the same total amount of electricity. these
are possibly totally bogus #s, but 600 watts is 10 60-watt
light bulbs, do YOU usually leave on that much light on
in your apartment at once for that length of time? Or
2 300-watt halogens, or about 15 40-watt fluorescent bulbs.
And that's if you think that lighting IS a significant
percentage...if you don't that's another story (i don't
have any statistics handy so i can't prove it) but if you
do, then this says that in this case they would be the
same i.e. the computer/s would be significant too. How much
less does it have to be before it becomes "insignificant"?

heating/cooling is probably the biggest cost (unless you're
in california...many people i knew in san jose didn't even
have AC, didn't need it :) but i wouldn't say that
computers are an insignificant percentage....

corley brigman
intel corp.
corley....@intel.com

NOT speaking for intel. just my personal opinions :)

Terje Mathisen

unread,

Feb 17, 2000, 3:00:00 AM2/17/00

to

I think this must be based on data from an area like Florida or Dallas,
where most of the electricity bill is for running the air conditioner.

In that case, if you use 300W extra round the clock on your PC+monitor,
the AC will have to use another 4-500W (or more? what is the efficiency
of an AC?) to remove the extra heat.

Del Cecchi

unread,

Feb 17, 2000, 3:00:00 AM2/17/00

to

In article <38AC4C5B...@bellatlantic.net>,

There is a IEEE standard that defines this. 1596.(something)
--

Del Cecchi
cecchi@rchland

Thomas Womack

unread,

Feb 17, 2000, 3:00:00 AM2/17/00

to

"Maynard Handley" <hand...@ricochet.net> wrote

> Why not just ditch the whole stupid scheme and name things the way any
> sane person does internally in their C--- int16, int32, int64 etc, and
> likewise packed8-128, packed 16-128 and so on?

To some extent, I'd rather call them JKLMN (J is a byte and the next ones go
up in powers of two, with the letters chosen so you don't hit B, W, Q, D, F
or G to avoid confusion), at least when doing the opcode encoding; the
problem with calling a 128bit object a DQ is the purely aesthetic one that
it makes some opcodes longer than others.

'O' is the obvious way to go, of course.

Are there now enough SIMD instruction sets around to start talking about
adding pack<byte,16> or pack<float,4> to the C++ standards? Or do all the
instruction sets work in subtly different ways, so such a suggestion would
either give the x86 a huge advantage or require every architecture to do
some things much more slowly than the native way?

Tom

Ian Stirling

unread,

Feb 17, 2000, 3:00:00 AM2/17/00

to

Rui Pedro Mendes Salgueiro <r...@rena.mat.uc.pt> wrote:

>In comp.arch George Herbert <gher...@crl3.crl.com> wrote:
>> A little birdie is going around reporting that OS vendors
>> are being told that the Willamette will require kernel
>> mods to operating systems to "shut down unused parts of
>> the CPU to keep it from melting".

>That can't be. The first unattended OS crash would destroy the
>processor.

And roumors that microsoft will get a commission from intel, for each
replacement processor sold, are totally unfounded :)

--
http://inquisitor.i.am/ | mailto:inqui...@i.am | Ian Stirling.
---------------------------+-------------------------+--------------------------
"Melchett : Unhappily Blackadder, the Lord High Executioner is dead
Blackadder : Oh woe! Murdered of course.
Melchett : No, oddly enough no. They usually are but this one just got
careless one night and signed his name on the wrong dotted line.
They came for him while he slept." - Blackadder II

Aaron Spink

unread,

Feb 17, 2000, 3:00:00 AM2/17/00

to

"Eric C. Fromm" <efr...@sgi.com> writes:

> It seems reasonable that (as an industry) we should finally be able to
> settle on some standard terminology. At least we all seem to agree on what
> a byte is now (thank God). I would suggest a terminology derived from an
> existing broadly accepted data format standard. Oh, say...IEEE-FP for
> instance? In that case, 32 bits contain a single precision FP value. Let's
> call that a Word. The double precision values are represented in 64 bits; a
> Double Word datum.
>
> All in favor say "Aye"!
>

So sorry, everyone knows that a word is 16bits or 2 bytes. 8 bytes
is a quadword, 16 bytes is an octaword. 16bits or 2 bytes being a
word is pretty standard.

The standard will about on the same day that everyone either uses
British or Metric units but not both.

Aaron Spink
not speaking for Compaq

Andy Glew

unread,

Feb 17, 2000, 3:00:00 AM2/17/00

to

I can't find the alarmist "computer equipment are 20% of US
electricity consumption" figures I have seen,
but http://www.mge.com/pdfs/home/app_cost.pdf
provides some numbers for household consumption.

I'll go down their list. Remember, I'm talking about electricity
consumption, not total energy consumption. And, at least in
the Midwest, these are not miscible: few people have coal
furnaces at home any more.

Most of the big energy consumers are powered by gas
in my area, so I won't include them in the electricity total:
- water heater
- clothes dryer
- gas range

Some equipment I don't have
- dehumidifier
- waterbed heater
- freezer
- dishwasher
- TV

Of the equipment I have, reading their dollar amounts,
assuming 0.7 cents / kWh
+ fridge 10.57$
+ lighting 6.15
+ clothes washer 1.12

I have 4 computers at home, not counting my latops. Of
these, two are on 24 hours a day - the others are powered off
because they are too bloody noisy. Unfortunately, I have had to
disable Energy Star power saving modes for these machines
on the advice of Walter Mossman's computer column ---
certainly, I get a hell of a lot fewer flakey bugs when power saving
is disabled. LINUX, of course, can't handle power saving modes
at all gracefully...

Taking the numbers this page provides for TV, multiplying by 4,
yields 10.08$ as the electricity cost per computer - a round 20$
for the two of them.
Even taking the more conservative estimates of 150 watts
per machine later in the pamphlet, I get roughly 15$.

If these numbers are at all accurate, my PCs are my biggest electricity
consumer at home, except for cooling in the summer. Given my AC costs
of roughly 60$ per month in the summer, my PCs are still a good fraction
of that. Moreover, as any of you who are sitting in offices warmed by
your PC can vouch, the PCs definitely contribute to heat load.

---

In office buildings, PCs are probably a much higher fraction
of total electricity consumption than in homes.

Serguei Patchkovskii

unread,

Feb 17, 2000, 3:00:00 AM2/17/00

to

Aaron Spink (sp...@padc13.pa.dec.com) wrote:
: So sorry, everyone knows that a word is 16bits or 2 bytes. 8 bytes

: is a quadword, 16 bytes is an octaword. 16bits or 2 bytes being a
: word is pretty standard.

Well, of course, here "everybody" knows that one word is 64 bits, or 8 bytes,
and can hold one usefully-sized floating point variable. A doubleword is,
indeed, twice the size - and can hold the same amount of data as your
octaword.

I am sure there are quite a few "everybody"s who won't agree with either
definition ...

/Serge.P

--
home page: http://www.cobalt.chem.ucalgary.ca/ps/

Jeffrey S. Dutky

unread,

Feb 17, 2000, 3:00:00 AM2/17/00

to

Serguei Patchkovskii wrote:
>
> Aaron Spink (sp...@padc13.pa.dec.com) wrote:
> : So sorry, everyone knows that a word is 16bits or 2 bytes.
> : 8 bytes is a quadword, 16 bytes is an octaword. 16bits or
> : 2 bytes being a word is pretty standard.
>
> Well, of course, here "everybody" knows that one word is 64
> bits, or 8 bytes, and can hold one usefully-sized floating
> point variable. A doubleword is, indeed, twice the size - and
> can hold the same amount of data as your octaword.
>
> I am sure there are quite a few "everybody"s who won't agree
> with either definition ...

yes, I don't agree, but I'm now completely muddled by the entire
thing. I think the problem stems from the use of the word 'word'
as the basis for definition. In the past the term 'word' has been
used by different manufacturers to refer to just about anything
they damn well pleased, whatever the relation to the hardware
in question. We could solve much of the difficulty if we just
got rid of the term and used 'byte' instead (though byte was
as contentious a term as 'word' at one time, it has since
settled down quite a bit. I think there is much more agreement
that a byte is 8-bits, on modern processors, than what the
proper width of a word is)

Therefore I propose a new set of data element names:

byte, sbyte or single-byte = 8-bits
double-byte or dbyte = 16-bits
quad-byte or qbyte = 32-bits
octa-byte or obyte = 64-bits
hexa-byte or hbyte = 128-bits

- Jeff Dutky

Leon Trotsky

unread,

Feb 17, 2000, 3:00:00 AM2/17/00

to

Serguei Patchkovskii <patc...@acs1.acs.ucalgary.ca> wrote in message
news:88hrf2$v0a$1...@nserve1.acs.ucalgary.ca...

> Aaron Spink (sp...@padc13.pa.dec.com) wrote:
> : So sorry, everyone knows that a word is 16bits or 2 bytes. 8 bytes
>

> Well, of course, here "everybody" knows that one word is 64 bits, or 8 bytes,

Yawn, this tangent about assembly programming nomenclature is rather boring.

We'd rather discuss interesting Williamette topics such as these:

How many times faster was the demo-ed Williamette than the fastest P6/Athlon?
When can we buy one?
How far can we overclock it? 4,5,6 Ghz?

//////\\
///_ _\\\
_| _\ /_ |_
|.|-(.)-(.)-.| Leon Trotsky
\| J |/
\ =###= /
\ .--. /
",###,/
"#"

Hank Oredson

unread,

Feb 18, 2000, 3:00:00 AM2/18/00

to

"Serguei Patchkovskii" <patc...@acs1.acs.ucalgary.ca> wrote in message
news:88hrf2$v0a$1...@nserve1.acs.ucalgary.ca...
> Aaron Spink (sp...@padc13.pa.dec.com) wrote:
> : So sorry, everyone knows that a word is 16bits or 2 bytes. 8 bytes

> : is a quadword, 16 bytes is an octaword. 16bits or 2 bytes being a
> : word is pretty standard.
>

> Well, of course, here "everybody" knows that one word is 64 bits, or 8 bytes,

> and can hold one usefully-sized floating point variable. A doubleword is,
> indeed, twice the size - and can hold the same amount of data as your
> octaword.
>
> I am sure there are quite a few "everybody"s who won't agree with either
> definition ...
>

> /Serge.P
>
> --
> home page: http://www.cobalt.chem.ucalgary.ca/ps/

Of course that is wrong!
A word is 36 bits, containing either 4 or 6 characters.
A double word is 72 bits.

Univac 1100 series.

(smile)

--

... Hank

http://horedson.home.att.net

McCalpin

unread,

Feb 18, 2000, 3:00:00 AM2/18/00

to

In article <38ADA890...@ix.netcom.com>,
Iain McClatchie <iai...@ix.netcom.com> wrote:
>
>I wouldn't be too surprised to see Willamette able to issue
>one MULPD and one ADDPD every 1500 MHz cycle, which would
>make for 6 DP GFLOPS, nearly as much as Itanium.

According to the published descripions from MPF'98,
at 800 MHz, Itanium will only be capable of 3.2 DP GFLOPS
(peak hypothetical value), or 6.4 SP GFLOPS (even more
hypothetical).

It is too early to tell exactly how effectively Willamette
will implement the paired 64-bit stuff. It is easy to
imagine really fast implementations (e.g., paired fully
pipelined multipliers and adders with the capability of
issuing multiple XMM instructions per clock cycle) and it
is easy to imagine much slower implementations (e.g.,
providing only a single multiplier and a single adder and
limiting issue rate to one op per cycle, with a throughput
of one result every two cycles). My two examples differ
by a factor of 4 in peak FP throughput.
--
John D. McCalpin, Ph.D. mcca...@austin.ibm.com
Senior Scientist IBM Power Microprocessor Development
"I am willing to make mistakes as long as
someone else is willing to learn from them."

Iain McClatchie

unread,

Feb 18, 2000, 3:00:00 AM2/18/00

to

Iain> I wouldn't be too surprised to see Willamette able to issue
Iain> one MULPD and one ADDPD every 1500 MHz cycle, which would
Iain> make for 6 DP GFLOPS, nearly as much as Itanium.

John> According to the published descripions from MPF'98,
John> at 800 MHz, Itanium will only be capable of 3.2 DP GFLOPS
John> (peak hypothetical value), or 6.4 SP GFLOPS (even more
John> hypothetical).

Doh! You're absolutely right. I slipped a bit. Itanium, as
I said, will issue two double-precision multiply-adds per
cycle, or 3.2 DP GFLOPS, as you said.

I can't imagine they'd allow a Willamette implementation out
the door that did faster FP than Itanium... or more precisely,
that they'd be willing to build the cache hierarchy into
Willamette to support more FP than Itanium. But I stand by
my assertion that FP units are getting super cheap these days.
Intel has figured out how to use domino gates as latches,
which nukes much of the overhead associated with pipelining
and makes 1500 MHz FP units fairly reasonable.

The thing I would like to know, more than anything else about
Willamette's (or the P6's) pipe, is how their out-of-order
issue logic handles a speculative first-level cache hit. What
I mean is this: You issue a load, and two or three cycles
later you want to issue an operation dependent on that load.
The data will be ready when the op gets to the ALU, but at the
time of issue there is NO WAY that you know you have a cache
hit, at least, not at 1500 MHz. So you issue the dependent op
speculatively.

So they have to hold onto the dependent op until they're sure
the load actually hit in the cache. I figure they probably
hold onto the op until they're sure the load either hit in the
primary, or missed but hit in the secondary. The latter is
probably ten or more cycles. Where do they hold onto that op?

If they leave it in the issue window, it uses up the hardware
that compares issue tags to see if the instruction is ready to
go. Broadcasting issue tags is a high-fanout operation that's
probably on the critical path. Broadcasting to ten more issue
slots (or twenty or thirty if we're issuing 2 or 3 ops/cycle)
has got to be horribly slow.

So maybe they've got some secondary issue window for ops whose
data is ready but might be bogus from a primary cache miss.
Wow. Just think of the state transitions an instruction can
go through. Does anyone who can say know of a simpler way?

Hmm. For those who don't like thinking about particular issue
windows, etc, you can think of this problem another way: a
simple in-order machine runs a particular schedule of
instructions through its pipes. If it takes a cache miss, it
can just stop and replay the schedule starting with the first
dependent instruction, once the data is ready. An out-of-order
machine wants to run that same particular schedule of
instructions through its pipes, until it sees a cache miss.
Then it wants to unschedule all the yet-to-be-done dependent
operations, reschedule all the dependent operations, and
ideally make forward progress through the operations independent
of the load while all this other activity is going on. The
tricky part is that you don't know a load missed in the primary
cache until maybe 5 cycles after it was originally scheduled.

-Iain McClatchie

Christopher Gomez

unread,

Feb 18, 2000, 3:00:00 AM2/18/00

to

Iain McClatchie wrote:
>
> The thing I would like to know, more than anything else about
> Willamette's (or the P6's) pipe, is how their out-of-order
> issue logic handles a speculative first-level cache hit. What
> I mean is this: You issue a load, and two or three cycles
> later you want to issue an operation dependent on that load.
> The data will be ready when the op gets to the ALU, but at the
> time of issue there is NO WAY that you know you have a cache
> hit, at least, not at 1500 MHz. So you issue the dependent op
> speculatively.

First the P6 doesn't issue dependents of loads speculatively.
When the load completes the dependent instructions are marked
data ready and are then able to be dispatched - although they still
might spend some time in the window waiting to get picked (dispatched).

Second, the details of Willamette's OOO core have yet to be release,
but until now there has been no claim that it issues dependents
of loads speculatively (although it could).

Third, as far as I remember the Alpha 21264 (or maybe its the 364) does issue
dependents of loads speculatively. Basically it uses a static Hit/Miss
prediction that the producer load will hit in the L1 D$ (read a level 1
data cache) - more specifically it statically assumes Hit in L1 D$ and
issues dependents after the minimum delay, which is the latency for the L1
D$ to produce the data (which I think is 2~3 cycles).

If the producer load misses in the cache which is known after the tag
lookup completes (this happens in parallel with the data read to that
set in the cache) then the dependents of the load which have been issued
have to be re-issued (replayed) so that the program maintains correctness.
The reason being is that the producer load missed in the cache
but it forwarded bad data to its dependents (consumers). It got bad
data because the cache read happens in parallel with the tag compare.

This usually involves some sort of signal which comes back from
the cache which indicates that a particular load missed in the cache
and that its dependents need to be replayed if they have been issued
(if there state in the window is 'executing' then the state is reset
back to 'ready'). All of the information needed is available in the
window (i.e. the dependency chain and the id of the producer load
sent back by the cache).

It is important that your cache hit rates be high because if you are issuing
consumers of loads speculatively and you have to replay them - then you waste
much of your execution resources and dispatch slots.

Another issue is speculative address loads and stores (these
are dependents of a producer load which misses in the cache
but are speculatively issued). Speculative Address load (spec_addr_ld)
instructions can go out to the cache with an address formed from
the bad data forwarded by the producer load (which missed in the cache).
If spec_addr_ld hits in the cache - so what - they get replayed anyways -
the only damage is wasted execution resources (which would probably go unused)
- but if they miss in the cache then they cause a cache line to get
evicted and a new line to be brought in - this pollutes the cache
and has negative effects on performance!

> So they have to hold onto the dependent op until they're sure
> the load actually hit in the cache. I figure they probably
> hold onto the op until they're sure the load either hit in the
> primary, or missed but hit in the secondary. The latter is
> probably ten or more cycles. Where do they hold onto that op?

Most likely in the issue window!

> If they leave it in the issue window, it uses up the hardware
> that compares issue tags to see if the instruction is ready to
> go. Broadcasting issue tags is a high-fanout operation that's
> probably on the critical path. Broadcasting to ten more issue
> slots (or twenty or thirty if we're issuing 2 or 3 ops/cycle)
> has got to be horribly slow.

It is possible to do the replay compare in parallel with the dispatch
hardware.

> So maybe they've got some secondary issue window for ops whose
> data is ready but might be bogus from a primary cache miss.
> Wow. Just think of the state transitions an instruction can
> go through. Does anyone who can say know of a simpler way?

As I said above you can use the same issue window, you just let the
cache system send back a message with the id of the load who missed
(he is still hanging around until he gets retired) and the dependency
info is already in the window (this is mentioned in the Microprocessor
Report on Hal's Sparc64 V processor given below) - then all of the
dependents get marked as ready to issue again. They might even issue again
if the execution unit is free - even though they may be already in
flight - this is OK because presumably your execution units are
pipelined.

Some references for your reading pleasure:

Speculation Techniques for Improving Load Related Instruction Scheduling
ISCA 99' by Adi Yoaz, Matan Erez, Ronny Ronen, and Stephan Jourdan

also see

Hal Makes Sparcs Fly - Sparc64V Employs Trace Cache and Superspeculation
for High ILP by Keith Diefendorff - Microprocessor Report, Volume 13,
Number 15, November 15, 1999 (Presented at Microprocessor Forum 1999)

The First Paper talks about how to implement a dynamic Hit/Miss predictor
for the producer loads similar to the branch prediction algorithms (hit/miss
history, 2 level adaptive, bi-mode, etc...).

The second paper talks about the new Sparc64 V from Hal which does implement
speculative issue of load dependent instructions (execution past memory deps)
- they call it "Superspeculation" a term coined by Mikko Lipasti, and John Shen
of CMU. They implement exactly what I said - issue the deps and if cache miss
remark them ready to be issued!

Also the paper talks about the trace cache implementation in the processor which is
quite interesting and is probably more relevant to Willamette which is also purported
to be a trace processor. Although not much detail has been released about its fetch
architecture except that it only decodes one x86 instruction per cycle and then
build traces (presumably at basic block boundaries) of predecoded and mapped x86
instructions (microops) and stores them in a trace cache to be reissued into the
window when a trace cache hit occurs. Some sensitivity to branch prediction accuracy
is to be expected ;-)

> --------------------------------------------------------------------
> -- All opinions are my own and do not reflect those of my employer.
> --
> -- Chris Gomez
> -- Performance Architect UltraSparc 5
> -- Sun Microelectronics
> -- cag...@eng.sun.com
> --------------------------------------------------------------------

Dennis O'Connor

unread,

Feb 18, 2000, 3:00:00 AM2/18/00

to

"Iain McClatchie" <iai...@ix.netcom.com> wrote ...
> Has anyone seen first shipment dates for Willamette?

Yes.
But I can't tell.
I'm a terrible tease. :-)
--
Dennis O'Connor dm...@primenet.com
Vanity Web Page http://www.primenet.com/~dmoc/

Paul Hsieh

unread,

Feb 18, 2000, 3:00:00 AM2/18/00

to

Terje Mathisen wrote:
> Thomas Womack wrote:
> >
> > On ftp://download.intel.com/design/processor/WmtSDG.pdf , you will find a
> > software developer's guide for Willamette.
> >
> > The 1.5GHz clock speed is presumably mostly due to the 20-stage pipeline,
>
> The P6 pipeline is of comparable length.

I'm not sure I can agree with you. The P6 is 13 stages according to
their own documentation. They are adding a whole K6 in terms of the
number of stages. :o) The top two stages (EIP, decode) are totally
asynchronous, however, and should not generally appear on the bottom
line.

> > and is I suppose the logical way to go if processors sell themselves on
> > their MHz figure; the double-pumped integer ALU seems an interesting
> > approach, though I suppose it's comparable to the 3GHz figures IBM were
>
> I wonder why they don't claim 3 GHz for Willamette, I bet we'll see this
> number somewhere by the time it is released.

They probably didn't want to be caught "exaggerating". 1.5Ghz is
impressive enough. Claims of 3Ghz would draw credibility attacks from
everywhere.

> > throwing about last week. Keeping decoded micro-ops in the instruction
> > cache (called Execution Trace Cache) is a clever idea. With a pipeline
> > that long,
>
> The Pentium chip did the same, only pre-decoded instructions could be
> issued in parallel, so the first invocation of any code would always run
> like a 486.

Hmmm ... good point. The structure for this cache is likely to be
radically different I should think.

> > the branch prediction had better be good - and the first priority in the
> > code optimisation guidelines is 'improve branch predictability'. At least
> > indirect branches are now predicted, and branch hints have been added.
>
> I wonder how large the new branch prediction unit is, since Intel state
> that it is
>
> "effectively combining all currently available prediction schemes"

They claimed that they were using a 4K entry branch target buffer.

[...]

> BTW, what's wrong with 'octaword'?

Somebody else probably used it first.

[...]

> > On the whole, this is an interesting-looking chip; it'll be a year before
> > I'd be able to afford a Willamette system, but the release of the
> > documentation now has had the desired effect of making me no longer lust
> > after Athlons as the ultimate in performance.

Hmmm ... you seem easily swayed. There are no quotations of actual
performance on benchmarks anywhere in this documentation, or anything
else Intel has stated thus far.

> > [...] I'll buy the Athlon anyway, knowing that there'll be a factor-of-two

> > upgrade available in mid-2001 (with 'upgrade' as in 'replace motherboard,
> > processor, power supply, cooling, memory and case').

:o) Somehow I don't think you are the first person to buy a computer,
with the full knowledge that something twice as fast would be available
in 18 months.

--
Paul Hsieh
http://www.pobox.com/~qed/cpujihad.shtml

Leon Trotsky

unread,

Feb 18, 2000, 3:00:00 AM2/18/00

to

Christopher Gomez <christop...@Eng.sun.com> wrote in message
news:38AE28A9...@Eng.sun.com...

> Iain McClatchie wrote:
> >
> > If they leave it in the issue window, it uses up the hardware
> > that compares issue tags to see if the instruction is ready to
> > go. Broadcasting issue tags is a high-fanout operation that's
> > probably on the critical path. Broadcasting to ten more issue
> > slots (or twenty or thirty if we're issuing 2 or 3 ops/cycle)
> > has got to be horribly slow.
>
> It is possible to do the replay compare in parallel with the dispatch
> hardware.

Why broadcast to several slots in the issue window?
Why not use indirection?

The tags in the issue window slots are essentially operand pointers.
When the cache data is available, it's moved into the load/store buffer,
the tag storage (ie what the instruction window tags can point to)
receives an index within the ld/st buffer, then the tag is marked valid,
thereby releasing the waiting load instructions.

chrisv

unread,

Feb 18, 2000, 3:00:00 AM2/18/00

to

On 18 Feb 2000 07:14:45 GMT, cdu@jawa. (chris ulrich) wrote:

> I'll paraphrase what I said in the hopes that you might understand me
>better the second time around.

Blah blah blah. When Andy Glew wrote:

>a) to reduce overall power consumption even in desktop mode
> - PCs consume a significant fraction of all US electricity

You responded with "this is silly." I'm proud of you that you use
fluorescent lights, but this does not make the goal of reducing power
consumption in PC's "silly." You obviously feel VERY important.

Leon Trotsky

unread,

Feb 19, 2000, 3:00:00 AM2/19/00

to

Iain McClatchie <iai...@ix.netcom.com> wrote in message
news:38ADA890...@ix.netcom.com...
> * If you need 64-bit addressing but also IA-32 compatibility at
> speed, you're either going to:
> * cross your fingers and hope that AMD saves your bacon, or
> * start porting and recompiling!

Extending the 8086 to 64 bits is nonsense, if not disgusting.

> I think the interesting architectural thing to talk about in
> Willamette is the trace cache. I don't believe this is the same
> thing as was in the P5. I think the P5 used the first pass of
> decoding to compute and then cache where the instruction boundaries
> were.

Yes, the P5 had a decode buffer that was 32 bytes or so,
which was nothing more than a simple buffer.

I'm guessing that Williamette's "trace cache" is little more than
an instruction cache that caches micro-ops instead of X86 opcodes.
This indicates that Williamette needs more stages to decode X86 opcodes
than older X86s, and the "trace cache" mitigates this disadvantage of a long pipe.

With smaller on-chip cache memory, and a single decode stage,
the designers of older X86s chose to cache the more compressed X86 opcodes.

Terje Mathisen

unread,

Feb 19, 2000, 3:00:00 AM2/19/00

to

Paul Hsieh wrote:
>
> Terje Mathisen wrote:
> > Thomas Womack wrote:
> > >
> > > On ftp://download.intel.com/design/processor/WmtSDG.pdf , you will find a
> > > software developer's guide for Willamette.
> > >
> > > The 1.5GHz clock speed is presumably mostly due to the 20-stage pipeline,
> >
> > The P6 pipeline is of comparable length.
>
> I'm not sure I can agree with you. The P6 is 13 stages according to
> their own documentation. They are adding a whole K6 in terms of the
> number of stages. :o) The top two stages (EIP, decode) are totally
> asynchronous, however, and should not generally appear on the bottom
> line.

However, assuming these 20 stages are the double-pumped parts of the
chip, the path from entry to exeit is actually shorter than on the P6,
right?

> > The Pentium chip did the same, only pre-decoded instructions could be
> > issued in parallel, so the first invocation of any code would always run
> > like a 486.
>
> Hmmm ... good point. The structure for this cache is likely to be
> radically different I should think.

Storing (relatively wide) P6 micro-ops would mean that the code cache
must be a lot wider than on a P6, it does look like the P6 microops were
designed like most microcode, i.e. very wide/flat encoding, disregarding
storage efficency.

I guess they _could_ use some sort of intermediate solution, gaining
some measure of fixed compression relative to simply caching all P6-type
micro-ops.

Paul Hsieh

unread,

Feb 19, 2000, 3:00:00 AM2/19/00

to

Terje.M...@hda.hydro.com says...

> Paul Hsieh wrote:
> > Terje Mathisen wrote:
> > > Thomas Womack wrote:
> > > > On ftp://download.intel.com/design/processor/WmtSDG.pdf , you will find a
> > > > software developer's guide for Willamette.
> > > >
> > > > The 1.5GHz clock speed is presumably mostly due to the 20-stage pipeline,
> > >
> > > The P6 pipeline is of comparable length.
> >
> > I'm not sure I can agree with you. The P6 is 13 stages according to
> > their own documentation. They are adding a whole K6 in terms of the
> > number of stages. :o) The top two stages (EIP, decode) are totally
> > asynchronous, however, and should not generally appear on the bottom
> > line.
>
> However, assuming these 20 stages are the double-pumped parts of the
> chip, the path from entry to exeit is actually shorter than on the P6,
> right?

Only the integer ALU stages are double pumped. That's the whole point of
architecture. If the whole thing were double pumped, they could just go
ahead and call it 3Ghz. I counted the stages, there are definately 20
distinct non-double pumped stages -- its a pretty insane architecture.

> > > The Pentium chip did the same, only pre-decoded instructions could be
> > > issued in parallel, so the first invocation of any code would always run
> > > like a 486.
> >
> > Hmmm ... good point. The structure for this cache is likely to be
> > radically different I should think.
>
> Storing (relatively wide) P6 micro-ops would mean that the code cache
> must be a lot wider than on a P6, it does look like the P6 microops were
> designed like most microcode, i.e. very wide/flat encoding, disregarding
> storage efficency.

Yes, but they showed an associated microcode ROM plus widget that is
associated to the trace cache. I suspect that what happens is that they
have selected a size that works for "most" x86=>micro-op translations for
the trace cache and for sizes that are too large, or otherwise degraded
they simply set a flag and a point into the microcode ROM. The software
writers guide seemed to be telling programmers to use x86 opcode that
have limited immediate operands which indicates that the trace cache may
have (aggregate) micro-op size limitations, which supports my theory.

> I guess they _could_ use some sort of intermediate solution, gaining
> some measure of fixed compression relative to simply caching all P6-type
> micro-ops.

Indeed, the intermediate solution could be less descriptive, while being
orthogonal and sufficient to encode the whole micro-op which is actually
expanded in a straightforward way in some stage somewhere. I.e., they
could use some sort of RISC-like opcodes with flags for x86 instruction
boundaries (so it can determine the instruction abort boundaries.)

--
Paul Hsieh
http://www.pobox.com/~qed/

Bernd Paysan

unread,

Feb 19, 2000, 3:00:00 AM2/19/00

to

Iain McClatchie wrote:
> Intel's attempt to establish a new CPU monopoly will have failed,
> but it won't matter, because 400 million of the 500 million cell
> phones sold in 2002 will have StrongArms in them.
>
> And at some point, these wonderful cheap high-speed general purpose
> computers running free Unix-compatible operating systems will
> disappear, making the lives of people like me much harder and more
> expensive, and the lives of people like my mom much easier and more
> expensive.

I doubt. The 400-500 million StrongARM-based cell phones will quite
likely run Linux (Samsung recenlty announced a Linux-based
StrongARM-PDA, that's just the start). Your mom won't notice any
difference, but the first thing you will do when you get the device is
to invoce a root bash, start the equivalent of dselect, and fetch a
whole bunch of packets over the net (like gcc, emacs, gdb, quake V ...
no, that one is included in the cell phone distro).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

H.W. Stockman

unread,

Feb 20, 2000, 3:00:00 AM2/20/00

to

"Leon Trotsky" <tro...@ussr.ru> wrote in message
news:38ae532b$0$2...@nntp1.ba.best.com...

> Iain McClatchie <iai...@ix.netcom.com> wrote in message
> news:38ADA890...@ix.netcom.com...
> > * If you need 64-bit addressing but also IA-32 compatibility at
> > speed, you're either going to:
> > * cross your fingers and hope that AMD saves your bacon, or
> > * start porting and recompiling!
>
> Extending the 8086 to 64 bits is nonsense, if not disgusting.

I think the main problem would be in the efficient choice of "modes" and
opcodes. When the 386 came along, the choice was a mode in which 8-bit and
32-bit instructions were easy (and a 66H or 67H pre-instruction byte was
required to give 16-bit ops), or a mode in which 8-bit and 16-bit ops were
easy (and a 66H or 67H pre-byte was required for 32-bit instructions). I'm
not sure there is any easy way to extend that scheme to include 64-bit ops,
without sacrificing compatibility or performance -- my guess is that the
pre-bytes are disfavored in heavily-pipelined chips.

chris ulrich

unread,

Feb 20, 2000, 3:00:00 AM2/20/00

to

In article <OuKr4.2630$nF4....@newsfeed.slurp.net>,
Gerard S. <ger...@prairietech.net> wrote:
%%I did some swags on the above assumptions, but plugged in a few
%%realistic numbers (hey! it's my swag...) and came up with ...
%%
%%For an (average?) PC consuming 200 watts/hour:

I would be surprised if the average PC consumed more than 40-50
watts, with some using 20-30 and some using 60-80. Just because the
power supply is rated to output 200 or 250 doesn't mean that the the
average system is running at or even within 50% of its rated limit
for any noteworthy bit of time. 200 watts is a *huge* amount of energy
to disipate from a metal box with just a couple of DC fans, after all.
(care to try to cool 2 60 watt light bulbs in a PC case with standard
pc fans?). I use computers every day that draw a real (measured)
100-300 watts of energy, and they're *big*, and have obvious thought
put into their cooling systems, and have lots of fans, and so on.
chris

[calculations removed]
%% 37,565,000,000 ====totals==== 50 million PS's
%%If the total US consumption for 1996 was 3.66e12 kilowatt-hours, then
%%the above 3.8e10 kilowatt-hours used by PC's was about 1.0 percent.
%%So it looks like your "more likely" guess of somewhere south of 1% is
%%right on the nose. Not bad swagging, Del.
%%Gerard S.

Leon Trotsky

unread,

Feb 20, 2000, 3:00:00 AM2/20/00

to

H.W. Stockman <hws...@wizard.com> wrote in message
news:8NXr4.56204$ox5.15...@tw11.nn.bcandid.com...

Good point.

Note the X86 one-byte opcode space was filled long ago.
Imagine, storing a 64 bit pointer into 64 bit addressable memory,
the counterpart of MOV [EDI],EAX, will take 19 bytes on Jackhammer!

2 (two byte opcode) + 1 (modrm) + 8 (disp) + 8 (imm) = 19 bytes

Another deficiency is the modrm field precludes using more than 8 GPRs !

AMD unwisely overlooked an existing IA32 feature that already supports
64 bit addressing, the Physical Address Extension (PAE).

Yuck, this 64-bit X86 might be the ugliest processor design ever.

Jason Stratos Papadopoulos

unread,

Feb 21, 2000, 3:00:00 AM2/21/00

to

Andy Glew (gl...@cs.wisc.edu) wrote:

: If a load comes back an L1 cache miss, you have to cancel the load,
: and any dependent operations that got scheduled assuming it was
: a cache hit.

: I believe the Alpha 21264 does something similar, which is why
: they try to predict L1 cache hits, to reduce the number of dependent
: ops that might have to be cancelled.

As I remember, according to the 21264 compiler writer's guide the 21264
schedules instructions to issue based on an L2 load hit by default, and
then switches to scheduling for an L1 hit if a certain fraction of the
last couple of loads hit in L1 cache.

I may have that backwards though...
jasonp

Andrew M. Dyer

unread,

Feb 21, 2000, 3:00:00 AM2/21/00

to

Personally, I would vote for shortening it to just oc-word because that's what
it is :-)

--
Andrew Dyer <adyer@enteractDOTcom>

Where do you want to go today?
Nevermind, you're coming with us.

Leon Trotsky

unread,

Feb 21, 2000, 3:00:00 AM2/21/00

to

Andreas Kaiser <a...@s.netic.de> wrote in message news:jNqxOIiB+nS+Yd...@4ax.com...

> On Sun, 20 Feb 2000 13:06:06 -0800, "Leon Trotsky" <tro...@ussr.ru>
> wrote:
>
> >Note the X86 one-byte opcode space was filled long ago.
> >Imagine, storing a 64 bit pointer into 64 bit addressable memory,
> >the counterpart of MOV [EDI],EAX, will take 19 bytes on Jackhammer!
> >
> >2 (two byte opcode) + 1 (modrm) + 8 (disp) + 8 (imm) = 19 bytes
>

> 64-bit addressing does not necessarily imply that immediate values and
> displacements are encoded in 64 bits.

My point was to illustrate how ugly the X86 would be if straightly extended
to 64 bits, regardless of other possible 64 bit addressing schemes.

No wonder Intel developed IA64.

> On the MPF99, AMD told that with
> the exception of "mov reg64,imm64", Sledgehammer encodes displacements
> and immediates in 32 bits, sign extended.

I would imagine that Sledgehammer is actually a divergent extension.

This I'm certain of: if all X86s were the Gorgon Sisters of microprocessors,
Sledgehammer would be Medusa ;-!

> >AMD unwisely overlooked an existing IA32 feature that already supports
> >64 bit addressing, the Physical Address Extension (PAE).
>

> A large database machine may want to able to use more than 4GB of
> shared buffer cache by each database process, which implies a
> virtual/linear address space in excess of 4GB. PAE however only allows
> more than 4GB physical memory, each process' linear memory is still
> limited to 4GB.

AFAIK the overwhelming reason why >4GB of real memory is needed on X86
is to contain a gigantic disk cache of a X86 file server.

The PAE feature is designed to efficiently switch between several
4GB virtual spaces. By reloading the CR3 register, the entire virtual
space can be changed. Or, by modifying a PTE, a page of virtual space
can be changed. Either way, any byte of 2^64 byte disk cache can be accessed
(PAE architecturally is 64 bits) using the existing 32 bit addressing instructions.
The inherent way a cache works would make switching virtual spaces a rare event.

Therefore, I see no worthwhile reason for AMD to kludge the X86 to 64 bits,
other than for marketing purposes.

> The K7 does support 1 of those 2 physical address extentions Intel
> defined. The 8-byte descriptor method is supported, just the 4MB-page
> based extention is not. And IIRC support for the latter has been added
> in the meantime.

PSE (4M paging) wasn't an extension of physical addresses.

Leon Trotsky

unread,

Feb 21, 2000, 3:00:00 AM2/21/00

to

Aaron Spink <sp...@padc13.pa.dec.com> wrote in message
news:sfqog99...@padc13.pa.dec.com...

> "Leon Trotsky" <tro...@ussr.ru> writes:
>
>
> > AFAIK the overwhelming reason why >4GB of real memory is needed on X86
> > is to contain a gigantic disk cache of a X86 file server.
> >
>

> Um, no. The overwhelming reason is to be able to support DB caches
> that are greater than 4GB or 3GB in most x86 operating systems.

"most X86 operating systems"

(ROL) as if Winblows 1898 could handle that amount of memory.

> DB caches need to be shared space between a large number of DB threads or
> processes. Also the ability to support in memory DB's for a variety
> of workloads is also becoming important.

As I explained, IA32's PAE can accomodate such tasks adequately,
but perhaps not elegantly enough to satisfy an academic.

> We are nearling the time when a mid-range server will have 32-64GB of
> ram and high end servers will support upwards of TB's of physical
> memory. You don't want to play TLB games to access that level of
> memory.

>
> Aaron Spink
> not speaking for Compaq

You aren't speaking for Compaq, but you are >>SPAMMING<< for DEC.

You dont much about the X86 architure, if anything.
The time required to execute MOV CR3 or MOV/MOV/INVLPG
(to change virtual space) is negligible on today's 700 Mhz P6 or Athlon.

Besides, why bother spamming? Alpha will be dead and gone, years from now,
when a fileserver or database process really must access all 2^64 bytes.
Analysts expect by 2002 or 2003, Compaq will be selling Itanium servers instead.

George Herbert

unread,

Feb 21, 2000, 3:00:00 AM2/21/00

to

Leon Trotsky <tro...@ussr.ru> wrote:
>Aaron Spink <sp...@padc13.pa.dec.com> wrote in message

>>Leon Trotsky <tro...@ussr.ru> wrote:
>> > AFAIK the overwhelming reason why >4GB of real memory is needed on X86
>> > is to contain a gigantic disk cache of a X86 file server.
>>
>> Um, no. The overwhelming reason is to be able to support DB caches
>> that are greater than 4GB or 3GB in most x86 operating systems.

>[...]

>
>> We are nearling the time when a mid-range server will have 32-64GB of
>> ram and high end servers will support upwards of TB's of physical
>> memory. You don't want to play TLB games to access that level of
>> memory.
>>
>> Aaron Spink
>> not speaking for Compaq
>
>
>You aren't speaking for Compaq, but you are >>SPAMMING<< for DEC.

Let me cut and paste the last bit of your posting:

>Besides, why bother spamming? Alpha will be dead and gone, years from now,
>when a fileserver or database process really must access all 2^64 bytes.
>Analysts expect by 2002 or 2003, Compaq will be selling Itanium servers instead.

This is spamming for Intel worse than anything Aaron said, by far.
Aaron was not advocating an architecture. You explicitly did.

If you want to be anonymous, fine. If you want to be partisan,
that's ok too, there are partisans for every chip line around
who come here from time to time. But don't go around doing what
you're accusing others of doing.

You are also out of your depth if you thought filesystem caches
were the driver for large main memory and yet are making wise
pronouncements about the future at the same time. You might
want to listen more, it looks like there are things you could learn.

-george william herbert
gher...@crl.com

Andreas Kaiser

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

On Sat, 19 Feb 2000 05:15:54 -0800, DONT.qed...@pobox.com (Paul
Hsieh) wrote:

>Yes, but they showed an associated microcode ROM plus widget that is
>associated to the trace cache. I suspect that what happens is that they
>have selected a size that works for "most" x86=>micro-op translations for
>the trace cache and for sizes that are too large, or otherwise degraded

Microcode doesn't help when you have to encode 32bit
immediates/displacements. Also it does not reduce the number of trace
cache slots needed for simple load-exec-store ops because microcode
needs hardware decoder generated housekeeping ops preceding the actual
microcode invocation (load immediate value to scratch reg, load
effective address to scratch reg).

>they simply set a flag and a point into the microcode ROM. The software
>writers guide seemed to be telling programmers to use x86 opcode that
>have limited immediate operands which indicates that the trace cache may
>have (aggregate) micro-op size limitations, which supports my theory

Discourage "cmp/mov [reg+disp32],imm32"? Discourage even "jump/call
disp32"? Even Intel still has to live with existing code for a while.

Assuming that otherwise a W. microop is roughly analogous to a P6
microop and excess microops are needed to encode 32bit values and
displacements, a limit of 3 such microps per clock cycle makes me
question what the trace cache is really good for.

I am sure that a trace cache of similar die size maps only a small
fraction of a K7-like I-cache+predecode. So for complex code without
dominating small loops, W.'s performance could get interesting.
Especially if it has just one single x86 decoder while the K7 can even
predecode two instructions in parallel in simple cases.

> I.e., they could use some sort of RISC-like opcodes with flags for x86 instruction
>boundaries (so it can determine the instruction abort boundaries.)

The boundaries are easy to encode. But it is not easy to encode the
associated program addresses (this/next instruction PC value). It's
only easy for Crusoe because of its commit/rollback support.

However it should be possible to use a similar approach for hardware
x86 implementations - rollback to a known state (e.g. start of trace
cache sequence) and re-run the offending code in a special "one x86
instruction at a time" mode.

Andreas Kaiser

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

On Sun, 20 Feb 2000 13:06:06 -0800, "Leon Trotsky" <tro...@ussr.ru>
wrote:

>Note the X86 one-byte opcode space was filled long ago.
>Imagine, storing a 64 bit pointer into 64 bit addressable memory,
>the counterpart of MOV [EDI],EAX, will take 19 bytes on Jackhammer!
>
>2 (two byte opcode) + 1 (modrm) + 8 (disp) + 8 (imm) = 19 bytes

64-bit addressing does not necessarily imply that immediate values and

displacements are encoded in 64 bits. On the MPF99, AMD told that with

the exception of "mov reg64,imm64", Sledgehammer encodes displacements
and immediates in 32 bits, sign extended.

>AMD unwisely overlooked an existing IA32 feature that already supports

>64 bit addressing, the Physical Address Extension (PAE).

A large database machine may want to able to use more than 4GB of
shared buffer cache by each database process, which implies a
virtual/linear address space in excess of 4GB. PAE however only allows
more than 4GB physical memory, each process' linear memory is still
limited to 4GB.

The K7 does support 1 of those 2 physical address extentions Intel

Peter da Silva

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

In article <38b1f9f4$0$2...@nntp1.ba.best.com>,

Leon Trotsky <tro...@ussr.ru> wrote:
> AFAIK the overwhelming reason why >4GB of real memory is needed on X86
> is to contain a gigantic disk cache of a X86 file server.

You forget large address space Oracle databases, as currently implemented on
Alpha processors.

A 32-bit version of QEMM.SYS is thus less desirable than a large flat address
space.

--
In hoc signo hack, Peter da Silva <pe...@baileynm.com>
`-_-' Ar rug tú barróg ar do mhactíre inniu?
'U`
"I *am* $PHB" -- Skud.

Aaron Spink

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

"Leon Trotsky" <tro...@ussr.ru> writes:

> AFAIK the overwhelming reason why >4GB of real memory is needed on X86
> is to contain a gigantic disk cache of a X86 file server.
>

Um, no. The overwhelming reason is to be able to support DB caches
that are greater than 4GB or 3GB in most x86 operating systems. DB

caches need to be shared space between a large number of DB threads or
processes. Also the ability to support in memory DB's for a variety
of workloads is also becoming important.

We are nearling the time when a mid-range server will have 32-64GB of

Brian Jonathan Lee

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

In article <38AC4C5B...@bellatlantic.net>,
Jeffrey S. Dutky <du...@bellatlantic.net> wrote:
>how about:
>
>byte = 1 byte (8-bits)
>half-word = 2 bytes (16-bits)
>word = 4 bytes (32-bits)
>long-word = 8 bytes (64-bits)
>deca-word = 10 bytes (80-bits)
>quad-word = 16 bytes (128-bits)
>octa-word = 32 bytes (256-bits)

this one makes the most sense! At work we actually say "d-word" (double
word) for 8bytes though. get's me so confused when someone says "hey look!
it's a 4 quad word transfer!"

bjl

Terje Mathisen

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

Leon Trotsky wrote:
> You dont much about the X86 architure, if anything.
> The time required to execute MOV CR3 or MOV/MOV/INVLPG
> (to change virtual space) is negligible on today's 700 Mhz P6 or Athlon.

Leon, please grow up a bit.

I lived through the years of Expanded Memory Managers.

At that time we had dedicated hw for page flipping, and no cache
problems at all, this was still so much slower than a simple linear
address that it _really_ did hurt.

Having to use OS-level interfaces to remap memory in the middle of a
tight loop which searches a huge hash table or B-tree, is not something
I'l like to contemplate.

Terje

PS. Your .sig picture was impressive the first time you used it, but now
you're just wasting bandwidth and screen real estate.

H.W. Stockman

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

"Leon Trotsky" <tro...@ussr.ru> wrote in message

news:38b230fb$0$2...@nntp1.ba.best.com...

> The time required to execute MOV CR3 or MOV/MOV/INVLPG
> (to change virtual space) is negligible on today's 700 Mhz P6 or Athlon.

It isn't so much the time spent in moves and address creation, as all the
time spent asking "where am I?" and the logic required to generate the
address. If arrays and databases were traversed always in an orderly
fashion, it wouldn't be a big deal. But they aren't.
[...]

>> Um, no. The overwhelming reason is to be able to support DB caches

[...]

> You dont much about the X86 architure, if anything.

??? this thread seems needlessly bellicose...

Dave Cherkus

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

Leon Trotsky (tro...@ussr.ru) wrote:
: Analysts expect by 2002 or 2003...

And analysts said in 1988 I should forget TCP/IP and learn OSI (at
least Marshall Rose made a lot of $$$ selling his OSI books) and said
that 1994 was the year ISDN would replace analog telephony and 1996 was
when NT would finally kill off UNIX. So, what were you saying again?

--
Dave Cherkus ------- UniMaster, Inc. ------ Contract Software Development
Specialties: UNIX Internals/Kernel TCP/IP Alpha Clusters Performance ISDN
Email: che...@UniMaster.COM ---------------- Is there life before death?

Carlo Razzeto

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

Windows 2000 Data center can use 64GB of memory... And 32 processors...

Carlo Razzeto

"Leon Trotsky" <tro...@ussr.ru> wrote in message
news:38b230fb$0$2...@nntp1.ba.best.com...

> Aaron Spink <sp...@padc13.pa.dec.com> wrote in message

> news:sfqog99...@padc13.pa.dec.com...

> > "Leon Trotsky" <tro...@ussr.ru> writes:
> >
> >
> > > AFAIK the overwhelming reason why >4GB of real memory is needed on X86
> > > is to contain a gigantic disk cache of a X86 file server.
> > >
> >

> > Um, no. The overwhelming reason is to be able to support DB caches

> > that are greater than 4GB or 3GB in most x86 operating systems.
>
>

> "most X86 operating systems"
>
> (ROL) as if Winblows 1898 could handle that amount of memory.
>
>

> > DB caches need to be shared space between a large number of DB threads
or
> > processes. Also the ability to support in memory DB's for a variety
> > of workloads is also becoming important.
>
>

> As I explained, IA32's PAE can accomodate such tasks adequately,
> but perhaps not elegantly enough to satisfy an academic.
>
>

> > We are nearling the time when a mid-range server will have 32-64GB of
> > ram and high end servers will support upwards of TB's of physical
> > memory. You don't want to play TLB games to access that level of
> > memory.
> >
> > Aaron Spink
> > not speaking for Compaq
>
>

> You aren't speaking for Compaq, but you are >>SPAMMING<< for DEC.
>

> You dont much about the X86 architure, if anything.

> The time required to execute MOV CR3 or MOV/MOV/INVLPG
> (to change virtual space) is negligible on today's 700 Mhz P6 or Athlon.
>

> Besides, why bother spamming? Alpha will be dead and gone, years from
now,
> when a fileserver or database process really must access all 2^64 bytes.
> Analysts expect by 2002 or 2003, Compaq will be selling Itanium servers
instead.
>
>

Paul Hsieh

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

Andreas Kaiser wrote:
> (Paul Hsieh) wrote:
> >Yes, but they showed an associated microcode ROM plus widget that is
> >associated to the trace cache. I suspect that what happens is that they
> >have selected a size that works for "most" x86=>micro-op translations for
> >the trace cache and for sizes that are too large, or otherwise degraded
>
> Microcode doesn't help when you have to encode 32bit
> immediates/displacements. Also it does not reduce the number of trace
> cache slots needed for simple load-exec-store ops because microcode
> needs hardware decoder generated housekeeping ops preceding the actual
> microcode invocation (load immediate value to scratch reg, load
> effective address to scratch reg).

Well, I have not read over the relelvant patents yet, so I am just
guessing but why not include a 32 bit displacement per micro-op? Its not
very space efficient, but given these sorts of problems it would seem as
though they would have to bite the bullet, and perhaps include at least a
single 32 bit offset per micro-op.

Perhaps a more space efficient way to do this is to include space for one
or two 32 bit offsets per "group" of micro-ops. Then there could be an
operand encoding for one of these immediates. This would certainly be in
line with their recommendation to reduce the use of immediates.

> >they simply set a flag and a point into the microcode ROM. The software
> >writers guide seemed to be telling programmers to use x86 opcode that
> >have limited immediate operands which indicates that the trace cache may
> >have (aggregate) micro-op size limitations, which supports my theory
>
> Discourage "cmp/mov [reg+disp32],imm32"? Discourage even "jump/call
> disp32"? Even Intel still has to live with existing code for a while.
>
> Assuming that otherwise a W. microop is roughly analogous to a P6
> microop and excess microops are needed to encode 32bit values and
> displacements, a limit of 3 such microps per clock cycle makes me
> question what the trace cache is really good for.

Well, if this is how the P6 does it then Intel has already dealt with
"living with existing code for a while" by simply taking a penalty for
immediates and offsets anyways. I am somewhat suspicious of the theory
that the micro-ops are completely identical. In the P6 there is really
no need to space optimize the encoding for micro-ops at all, however in
Willamette, since they need to cache a lot of them, I suspect they would
need to make a reasonably efficient internal encoding.

> I am sure that a trace cache of similar die size maps only a small
> fraction of a K7-like I-cache+predecode.

Well since the K7 doesn't use a trace cache (just 72 entry instruction
control unit) they have no need to optimize the encoding size for their
RISC86 ops.

> [...] So for complex code without dominating small loops, W.'s

> performance could get interesting.
>
> Especially if it has just one single x86 decoder while the K7 can even
> predecode two instructions in parallel in simple cases.

Well, another thing we don't know is what the real world penalty for
ejecting things out of and bringing things into Willamette's trace cache.
The one way decoder may be an minor consideration in comparison to the
speed of the on chip L2 cache.

> >I.e., they could use some sort of RISC-like opcodes with flags for
> >x86 instruction boundaries (so it can determine the instruction abort
> >boundaries.)
>
> The boundaries are easy to encode. But it is not easy to encode the
> associated program addresses (this/next instruction PC value). It's
> only easy for Crusoe because of its commit/rollback support.
>
> However it should be possible to use a similar approach for hardware
> x86 implementations - rollback to a known state (e.g. start of trace
> cache sequence) and re-run the offending code in a special "one x86
> instruction at a time" mode.

Well, I suspect that this would decrease the effective use of the micro-
op stream. If special roll back micro-ops were encoded along side each
speculation, I think that the only reasonable approach that could be used
to cover all cases of speculation would include check and rollback micro-
ops for every memory access, every loop, every divide, etc. Somehow I
don't think it would be worth it. There certainly was no mention of
check and rollback micro-ops at the presentation by Glen Hinton, however
that is not necessarily proof of anything.

I am sure that Transmeta's approach is more clever in that the Code
Morpher can identify the start of a basic block and do more "sweeping"
check/rollbacks to decrease the need to constantly be issuing them.

Robert Harley

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

"Carlo Razzeto" <craz...@hotmail.com> writes:
> Windows 2000 Data center can use 64GB of memory... And 32 processors...

^^^^^^^^^^^^^^^^^
Sure, and my car can use 32 wheels. Two to actually provide traction.
Two more to get dragged along by the first two. The rest as dead
weight in the trunk.

Bye,
Rob.

Greg Alexander

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

In article <38B23C39...@hda.hydro.com>, Terje Mathisen wrote:

>Leon Trotsky wrote:
>> You dont much about the X86 architure, if anything.
>> The time required to execute MOV CR3 or MOV/MOV/INVLPG
>> (to change virtual space) is negligible on today's 700 Mhz P6 or Athlon.
>

>Leon, please grow up a bit.
>
>I lived through the years of Expanded Memory Managers.
>
>At that time we had dedicated hw for page flipping, and no cache
>problems at all, this was still so much slower than a simple linear
>address that it _really_ did hurt.
>
>Having to use OS-level interfaces to remap memory in the middle of a
>tight loop which searches a huge hash table or B-tree, is not something
>I'l like to contemplate.

Me too. :) I'm disgusted to see anyone even SUGGESTING limiting
individual programs to be unable to address all physical RAM in the same
address space. When I switched from DOS to Linux, I gave up paged RAM
interfaces and /I am not going back/. I dare you to try to sell a system
to programmers based on anything of the sort.

Carlo Razzeto

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

Watch the MS Keynote, they demoed a nice Unisis MainFrame with 16proc's
calculating 4.2 Billion air routs a second for the cheapest fairs.... I
doubt your home computer could do that...

Carlo Razzeto

"Robert Harley" <har...@corton.inria.fr> wrote in message
news:rz7itzh...@corton.inria.fr...

Leon Trotsky

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

Terje Mathisen <Terje.M...@hda.hydro.com> wrote in message
news:38B23C39...@hda.hydro.com...

> Leon Trotsky wrote:
> > You dont much about the X86 architure, if anything.
> > The time required to execute MOV CR3 or MOV/MOV/INVLPG
> > (to change virtual space) is negligible on today's 700 Mhz P6 or Athlon.
>
> Leon, please grow up a bit.

Unlike you, I'm not interested in flame wars.

> I lived through the years of Expanded Memory Managers.
>
> At that time we had dedicated hw for page flipping, and no cache
> problems at all, this was still so much slower than a simple linear
> address that it _really_ did hurt.

"flipping 64KB" in MSDOS isn't the same situation as "flipping 4GB"
on a UNIX fileserver.

> Having to use OS-level interfaces to remap memory in the middle of a
> tight loop which searches a huge hash table or B-tree, is not something
> I'l like to contemplate.

You dont know enough about X86 or kernel architectures to argue with me.

WRT to a big X86 fileserver, the filesystem driver would run in kernel mode (CPL0),
so it wouldnt need to call a "OS-level interface": its part of the OS kernel.

Besides, a big X86 fileserver probably would have multiple processors,
so subdividing a hash or B-tree (the only good point you raised)
into 4GB process-es, would allow the processors to search in parallel.
Eg a 4x X86 fileserver could very efficiently manage a 16GB disk cache.

>PS. Your .sig picture was impressive the first time you used it, but now
>you're just wasting bandwidth and screen real estate.

Troll, you've been elected to my blocked list.

Sander Vesik

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

In comp.arch Leon Trotsky <tro...@ussr.ru> wrote:
> Terje Mathisen <Terje.M...@hda.hydro.com> wrote in message
> news:38B23C39...@hda.hydro.com...
>> Leon Trotsky wrote:
>> > You dont much about the X86 architure, if anything.
>> > The time required to execute MOV CR3 or MOV/MOV/INVLPG
>> > (to change virtual space) is negligible on today's 700 Mhz P6 or Athlon.
>>
>> Leon, please grow up a bit.

> Unlike you, I'm not interested in flame wars.

>> I lived through the years of Expanded Memory Managers.
>>
>> At that time we had dedicated hw for page flipping, and no cache
>> problems at all, this was still so much slower than a simple linear
>> address that it _really_ did hurt.

> "flipping 64KB" in MSDOS isn't the same situation as "flipping 4GB"
> on a UNIX fileserver.

No. It is much worse. Way much worser. Because all kinds of strange things
start to happen. Think of the TLB for a starter.

>> Having to use OS-level interfaces to remap memory in the middle of a
>> tight loop which searches a huge hash table or B-tree, is not something
>> I'l like to contemplate.

> You dont know enough about X86 or kernel architectures to argue with me.

He *DOES*.

> WRT to a big X86 fileserver, the filesystem driver would run in kernel mode (CPL0),
> so it wouldnt need to call a "OS-level interface": its part of the OS kernel.

It still needs to call OS level interface, even if it itself is part
of the OS kernel. It needs to call a big icky thing and that sounds
really wrong inside a tight loop. And wastes a lot of time.

> Besides, a big X86 fileserver probably would have multiple processors,
> so subdividing a hash or B-tree (the only good point you raised)
> into 4GB process-es, would allow the processors to search in parallel.
> Eg a 4x X86 fileserver could very efficiently manage a 16GB disk cache.

Care to present a workable solution for that in pseudocode? What happens
at the times the tree needs to be balanced / the hash shrinks/grows? Did
you give some consdiration to the implementation of what you are proposing?

>>PS. Your .sig picture was impressive the first time you used it, but now
>>you're just wasting bandwidth and screen real estate.

> Troll, you've been elected to my blocked list.

Well, I might see how one could be lead to believe some things you have
said, but not that Terje is trolling...

> //////\\
> ///_ _\\\
> _| _\ /_ |_
> |.|-(.)-(.)-.| Leon Trotsky
> \| J |/
> \ =###= /
> \ .--. /
> ",###,/
> "#"

--
Sander

There is no love, no good, no happiness and no future -
these are all just illusions.

Tim McCaffrey

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

In article <38b2dc12$0$2...@nntp1.ba.best.com>, tro...@ussr.ru says...

>
>Terje Mathisen <Terje.M...@hda.hydro.com> wrote in message
>news:38B23C39...@hda.hydro.com...

>

>You dont know enough about X86 or kernel architectures to argue with me.
>
>

ROFL. Leon, Terje has forgotten more about such things than you will ever
know.

Tim McCaffrey

Terje Mathisen

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

Carlo Razzeto wrote:
>
> Watch the MS Keynote, they demoed a nice Unisis MainFrame with 16proc's
> calculating 4.2 Billion air routs a second for the cheapest fairs.... I
> doubt your home computer could do that...

Watch your step!

Robert Harley happens to play with high-grade '264 Alphas, it is very
conceivable that he has one or more machines capable of numbers in the
same ballpark, particularly if he's allowed to optimize the algorithm
first.
:-)

Terje

Andrew Reilly

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

On Tue, 22 Feb 2000 10:59:50 -0800, Leon Trotsky wrote:

(referring to a post by Terje Mathisen <Terje.M...@hda.hydro.com>)

>You dont know enough about X86 or kernel architectures to argue with me.

Now that's one of the funniest things I've read in this group
for a long time. Thanks for the morning's entertainment...

--
Andrew

Lawrence Kirby

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

In article <88ujfd$1eup$1...@walter.acs.nmu.edu>
craz...@hotmail.com "Carlo Razzeto" writes:

>Watch the MS Keynote, they demoed a nice Unisis MainFrame with 16proc's
>calculating 4.2 Billion air routs a second for the cheapest fairs.... I
>doubt your home computer could do that...

Nor would I want it to. I only wanted to get a flight from London
to Manchester, trust Microsoft to turn it into a computing nightmare.
:-)

--
-----------------------------------------
Lawrence Kirby | fr...@genesis.demon.co.uk
Wilts, England | 7073...@compuserve.com
-----------------------------------------

Andreas Kaiser

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

Some time ago I regularly used the OS/2 port of GCC (EMX) which used
the ancient a.out format (which doesn't have a const section) and
therefore it did not refrain from putting switch tables and string
constants in the code section, right in the middle of the code. Of
course this resulted in some performance penalty on all CPUs having
split I/D caches, but it was limited to code/data conflicts within a
fairly small cache line.

However Willie is different. IMHO a trace cache cannot easily detect
self modifying code. So W. instead seems to use the TLB for this
purpose: A page cannot be located in both I- and D-TLB (Intel: "Do not
put code and data on the same 4K page."). Maybe the whole trace cache
has to be flushed if a store hits an I-TLB entry. I don't know how
many legacy programs are affected, but those which are (plain old DOS?
Win16? JIT compiler?) could end up being faster on the old Pentium.

Gruss, Andreas

Andreas Kaiser

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

On Tue, 22 Feb 2000 08:25:29 -0800, DONT.qed...@pobox.com (Paul
Hsieh) wrote:

>> However it should be possible to use a similar approach for hardware
>> x86 implementations - rollback to a known state (e.g. start of trace
>> cache sequence) and re-run the offending code in a special "one x86
>> instruction at a time" mode.
>
>Well, I suspect that this would decrease the effective use of the micro-
>op stream. If special roll back micro-ops were encoded along side each
>speculation,

A trace cache sequence does not necessarily cover just a single
speculation. IMHO it would be pretty useless if it did.

The trace cache is not indexed by program address except for the first
in a sequence of trace cache bundles (after a branch misprediction or
when the prior sequence terminated due to sequence length
restriction). Further bundles use the next cache row with way index
chaining (see Intel patent).

It would be fairly natural to use this start of a sequence to
checkpoint the microop stream. It doesn't matter how much speculative
ops are in further down as long as all of them can be undone. So if
retirement can be deferred until the whole sequence is done, the
mentioned commit/rollback is rather easy (roughly comparable to
misprediction/exception treatment). It results in some kind of
in-order backend queue and an upper limit to the length of a sequence.

It doesn't really matter where you place the checkpoint, except that
it must be at a x86 instruction boundary. IMHO basic blocks loose some
significance with trace caches.

Gruss, Andreas

Paul DeMone

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

Leon Trotsky wrote:
[...]

> Besides, why bother spamming? Alpha will be dead and gone, years from now,
> when a fileserver or database process really must access all 2^64 bytes.
> Analysts expect by 2002 or 2003, Compaq will be selling Itanium servers instead.

If I had a dollar for every time I heard Alpha was doomed (and mostly
from more authoritative sources than this) I could afford one.

--
Paul W. DeMone The 801 experiment SPARCed an ARMs race of EPIC
Kanata, Ontario proportions to put more PRECISION and POWER into
dem...@mosaid.com architectures with MIPSed results but ALPHA's well
pde...@igs.net that ends well.

Bill Todd

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

Am I the only person who finds that number just a bit unbelievable? I mean,
*I* personally can find the cheapest route from 4.2 billion routes in a
second if someone has been kind enough to sort them in cost order first, but
the idea that each processor can somehow process information for a route in
4 ns. (the performance implicit in the 4.2 billion figure - since for sorted
data, the number is essentially irrelevant) seems a bit of a stretch, given
that it must not only perform a price comparison but first must check that
the route is the desired one.

- bill

Terje Mathisen <Terje.M...@hda.hydro.com> wrote in message

news:38B2EE...@hda.hydro.com...

> Carlo Razzeto wrote:
> >
> > Watch the MS Keynote, they demoed a nice Unisis MainFrame with 16proc's
> > calculating 4.2 Billion air routs a second for the cheapest fairs.... I
> > doubt your home computer could do that...
>

Leon Trotsky

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

Sander Vesik <san...@haldjas.folklore.ee> wrote in message > In comp.arch Leon Trotsky

<tro...@ussr.ru> wrote:
> > Terje Mathisen <Terje.M...@hda.hydro.com> wrote in message

> > news:38B23C39...@hda.hydro.com...

> >> Leon Trotsky wrote:
> > "flipping 64KB" in MSDOS isn't the same situation as "flipping 4GB"
> > on a UNIX fileserver.
>
> No. It is much worse. Way much worser.

I have heard EMS page flipping was too frequent on MSDOG.

> >> Having to use OS-level interfaces to remap memory in the middle of a
> >> tight loop which searches a huge hash table or B-tree, is not something
> >> I'l like to contemplate.
>

> > You dont know enough about X86 or kernel architectures to argue with me.
>

> He *DOES*.

Terje seems to be merely a dabbler in X86 microprocessors.

> > WRT to a big X86 fileserver, the filesystem driver would run in kernel mode (CPL0),
> > so it wouldnt need to call a "OS-level interface": its part of the OS kernel.
>
> It still needs to call OS level interface, even if it itself is part
> of the OS kernel.

Ever heard of inline asm code?

> It needs to call a big icky thing and that sounds

"big icky thing": you don't know what you are talking about.

> really wrong inside a tight loop. And wastes a lot of time.

No, what you think of as a tight loop would be a loop that
scans the 4GB virtual space. The code to change virtual spaces
would be somewhere in an outer outer loop, hardly ever executed,
by the inherent nature of a disk cache.

> Well, I might see how one could be lead to believe some things you have
> said, but not that Terje is trolling...

Who are you to speak for Terje's motives? Are you Terje?

The flame bait Terje threw into this technical discussion was obvious,
and disappointing.

Carlo Razzeto

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

You should, I made a mistake.... perday... Sorry about that... But still...
Point is MS is making an initaly impressive segway into the MainFrame
market.... Now I know it may seem sacraligious to think that a company like
microsoft could break into such a space, but you should keep your minds
open... Not only UNIX can run a mainframe...

Carlo Razzeto

"Bill Todd" <bill...@foo.mv.com> wrote in message
news:88v6bi$efb$1...@pyrite.mv.net...

> Am I the only person who finds that number just a bit unbelievable? I
mean,
> *I* personally can find the cheapest route from 4.2 billion routes in a
> second if someone has been kind enough to sort them in cost order first,
but
> the idea that each processor can somehow process information for a route
in
> 4 ns. (the performance implicit in the 4.2 billion figure - since for
sorted
> data, the number is essentially irrelevant) seems a bit of a stretch,
given
> that it must not only perform a price comparison but first must check that
> the route is the desired one.
>
> - bill
>

> Terje Mathisen <Terje.M...@hda.hydro.com> wrote in message

Leon Trotsky

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

Andreas Kaiser <a...@s.netic.de> wrote in message news:Yw6zOKGtk0kE71...@4ax.com...

>Some time ago I regularly used the OS/2 port of GCC (EMX) which used
>the ancient a.out format (which doesn't have a const section) and
>therefore it did not refrain from putting switch tables and string
>constants in the code section, right in the middle of the code. Of
>course this resulted in some performance penalty on all CPUs having
>split I/D caches, but it was limited to code/data conflicts within a
>fairly small cache line.

There should be no functional/performance issues as long as the
string constant is only read.

But, if an attempt is made to modify the string in EMX's memory model
as you describe, then a page fault will occur in a code page.

> However Willie is different. IMHO a trace cache cannot easily detect
> self modifying code.
>
> So W. instead seems to use the TLB for this
> purpose: A page cannot be located in both I- and D-TLB (Intel: "Do not
> put code and data on the same 4K page."). Maybe the whole trace cache
> has to be flushed if a store hits an I-TLB entry.

What I-TLB entry if the X86 has paging disabled?
What I-TLB entry if MOV CR3 was previouly executed?

Because X86 caches have always been physically addressed,
I would guess that the trace cache entries are tagged with translated
physical addresses that point within previously executed code segments.

emanuel stiebler

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

----- Original Message -----
From: Carlo Razzeto <craz...@hotmail.com>
Newsgroups: comp.arch,comp.sys.intel
Sent: Tuesday, February 22, 2000 18:05
Subject: Re: 64 bit X86 ugliness (Re: Williamette trace cache (Re: First
view of Willamette))

> You should, I made a mistake.... perday... Sorry about that... But
still...
> Point is MS is making an initaly impressive segway into the MainFrame
> market....

I give them as much credit for this demo, as I gave them the last time as
they tried to attach a scanner to win98. ;-)

> Now I know it may seem sacraligious to think that a company like
> microsoft could break into such a space, but you should keep your minds
> open...

It is sacraligious ;-)

> Not only UNIX can run a mainframe...

Sure. VMS, MVS, ... ;-)

> Carlo Razzeto

Cheers & have fun,
emanuel

Del Cecchi

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

That is OS/390 now.And VM and VSE And I hear Linux runs on 390 or
something like that. :-) Where are mainframes running VMS?

>
> > Carlo Razzeto
>
> Cheers & have fun,
> emanuel

Del Cecchi

Carlo Razzeto

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

After a billion dollars development, I want to see what they have.... I
think Windows could potentialy be an alternitive MainFrame operating system,
though it is their first attempt to break into that market. I'm going to
wait and see what the reviews are when data center comes out... However
apperantly Unisis beleives in it enough to make a product based on it, so
that says somthing to me...

Carlo Razzeto

"emanuel stiebler" <e...@ecubics.com> wrote in message
news:88vfr7$4kl$1...@ffx2nh3.news.uu.net...

> ----- Original Message -----
> From: Carlo Razzeto <craz...@hotmail.com>
> Newsgroups: comp.arch,comp.sys.intel
> Sent: Tuesday, February 22, 2000 18:05
> Subject: Re: 64 bit X86 ugliness (Re: Williamette trace cache (Re: First
> view of Willamette))
>
>
> > You should, I made a mistake.... perday... Sorry about that... But
> still...
> > Point is MS is making an initaly impressive segway into the MainFrame
> > market....
>
> I give them as much credit for this demo, as I gave them the last time as
> they tried to attach a scanner to win98. ;-)
>
> > Now I know it may seem sacraligious to think that a company like
> > microsoft could break into such a space, but you should keep your minds
> > open...
>
> It is sacraligious ;-)
>
> > Not only UNIX can run a mainframe...
>
> Sure. VMS, MVS, ... ;-)
>

Dennis O'Connor

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

"Paul DeMone" <pde...@igs.net> wrote in message
news:38B31635...@igs.net...

> If I had a dollar for every time I heard Alpha was doomed (and mostly
> from more authoritative sources than this) I could afford one.

Sure, but you'd make more money if you could get a dollar
for every time someone predicted the Imminent Death of USENET.
--
Dennis O'Connor dm...@primenet.com
Vanity Web Page http://www.primenet.com/~dmoc/
Follow-ups set.

Christopher Gomez

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

Ho-Seop Kim wrote:
>
> Well, it seems Iain's point is that it could take multiple clock cycles
> to do associative lookups over relatively large instruction window. You
> don't have to have a unified window - P6 and others have multiple
> reservation stations.

Where do you see that the P6 has multiple reservation stations?

According to the Intel Documentation - the P6 microarchitecture
issues micro-ops to a SINGLE 20 entry 5 port reservation station.
Or more specifically a central unified window.

Note: the ROB has 40 entries.

Incidentally three of the functional units (Simple FPU/Complex FPU/Complex IEU)
are on Port 0 (zero). Port 1 has Simple IEU and JEU, Port 2 has the Load Unit,
Port 3 has the Store Address Generation Unit, and Port 4 has the Store Data Unit.

The PowerPC 604 has distributed Reservation stations and so does the Sparc64 V. I
am not sure about the G3 PowerPC 740~750 series or the G4 PowerPC series.

> BTW, to me the trace cache in Sparc64 V seems to be a way to get around
> the "complexity" issues rather than having more fetch bandwidth. It
> heavily decodes instructions to reduce # of register file ports and
> complexity of steering logic. It could be the case for Willamette too
> - along with x86 predecoding.

The trace cache in Sparc64 V is a way to decouple the front-end
(Fetch/Dependency Check/Regeister Port/Execution Slot Allocate) from the Execute pipeline.
A trace packet in the trace cache contains up to 8 instructions (2 basic blocks), the
intrapacket/interpacket dependency graph, register port mappings, execution slot mappings,
and operand forwarding paths.

This leaves only instruction decode (1 cycle), in-flight register mapping/rename (1 cycle), dispatch
(also called insert into RS - where instructions and their operands or tags are sent to the reservation
stations) and if the reservation station is empty it is bypassed and the instruction is sent directly
to the execution unit (1 cycle) - if the instruction is a simple RISC instruction execution latency
is 1 cycle, then the write back stage occurs (1 cycle) where results are written into the ROB (reorder buffer)
Now results are available to later instructions - this puts the performance critical pipeline
at 4 cycles for single cycle latency RISC ops.

There are three more stages for retiring the instructions. The complete pipeline of the machine
is 7 cycles for fetch/trace packet creation, 4 cycles for rename/dispatch/execute, and 3 cycles for
retire for a total of 15 cycles (in the case of single cycle instructions).

The front end and the back end are decoupled from the OOO core! The total window size (the sum
of the distributed reservation stations) is 64 entries.

Source is:
Microprocessor Report, Volume 13, Number 15, November 15, 1999
Hal Makes Sparcs Fly - Sparc64V Employs Trace Cache and Superspeculation

Intel has recently disclosed more details at the Intel Developer Forum - does any on have
any info on Willamette's Trace architecture - From the picture in this weeks EE Times it seems
very similar to the Sparc64 except that the instruction fetcher is only one instruction wide,
and also the trace cache holds predecoded micro-ops!

> --------------------------------------------------------------------
> -- All opinions are my own and do not reflect those of my employer.
> --
> -- Chris Gomez
> -- Performance Architect UltraSparc 5
> -- Sun Microelectronics
> -- cag...@eng.sun.com
> --------------------------------------------------------------------

John Gardner

unread,

Feb 22, 2000, 3:00:00 AM2/22/00

to

H.W. Stockman <hws...@wizard.com> wrote in message news:KzIs4.61212>

> "Leon Trotsky" <tro...@ussr.ru> wrote in message
>

> But most important, he [Terje] has been modest and tolerant, and has generally
> conducted himself with the discipline implied in his SIG. It is very, very
> hard to get him to say an unkind word about anyone.

I beg to disagree, H.W.

This Terje opened his post by immediately flaming the poor dead Commie.

As far as I can tell, Trotsky never provoked Terje, other than daring
to use a multi-line sig in Terje's newsgroups.

Terje Mathisen Terje.M...@hda.hydro.com wrote,

>
> Leon, please grow up a bit.

Flame those damned Commie bastards.

JG

Bruce Hoult

unread,

Feb 23, 2000, 3:00:00 AM2/23/00

to

In article <38b2dc12$0$2...@nntp1.ba.best.com>, "Leon Trotsky"
<tro...@ussr.ru> wrote:

> Terje Mathisen <Terje.M...@hda.hydro.com> wrote

> > Having to use OS-level interfaces to remap memory in the middle of a
> > tight loop which searches a huge hash table or B-tree, is not something
> > I'l like to contemplate.
>
>
> You dont know enough about X86 or kernel architectures to argue with me.

ha ha ha ha ha ha ha ha ha ha

That's so funny.

Jeffrey S. Dutky

unread,

Feb 23, 2000, 3:00:00 AM2/23/00

to

Brian Jonathan Lee wrote:
>
> In article <38AC4C5B...@bellatlantic.net>,
> Jeffrey S. Dutky <du...@bellatlantic.net> wrote:
> >how about:
> >
> >byte = 1 byte (8-bits)
> >half-word = 2 bytes (16-bits)
> >word = 4 bytes (32-bits)
> >long-word = 8 bytes (64-bits)
> >deca-word = 10 bytes (80-bits)
> >quad-word = 16 bytes (128-bits)
> >octa-word = 32 bytes (256-bits)
>
> this one makes the most sense! At work we actually say
> "d-word" (double word) for 8bytes though. get's me so
> confused when someone says "hey look! it's a 4 quad word
> transfer!"

looking back on what I wrote, I can't seem to find the
sense in it. The terms are not very descriptive and you
still need to agree on the definition of 'word' in order
to make much sense out of it.

While it agrees quite nicely with some of my own assumptions,
based largely on a misspent youth programming PDP-11 and
68K assembly, it isn't generally applicable.

I much prefer my second suggestion, which names data sizes
for the number of bytes in the element:

byte = 1 byte = 8-bits
dual-byte = 2 bytes = 16-bits
quad-byte = 4 bytes = 32-bits
octa-byte = 8 bytes = 64-bits
hexa-byte = 16 bytes = 128-bits

I'd also like to be able to use the terms dbyte, qbyte,
obyte and hbyte, pronounced dee-byte, queue-byte (not to
be confused with biblical unit of information storage,
the qbit ;-), oh-byte, and aitch-byte.

granted, it doesn't scale well beyond 16 bytes, but that
should be enough for anybody <:-o

- Jeff Dutky

H.W. Stockman

unread,

Feb 23, 2000, 3:00:00 AM2/23/00

to

"Leon Trotsky" <tro...@ussr.ru> wrote in message

news:38b331f1$0$2...@nntp1.ba.best.com...

> Terje seems to be merely a dabbler in X86 microprocessors.

OK, let's bring this back to reality and civility.

Terje sleuthed out many of the previously undocumented oddities of the
Pentium modes back in 1994, and he published a seminal article in BYTE (back
when BYTE was a noble mag). He was picked by Intel to find solutions to the
FDIV bug, and he wrote some of the fastest algorithms to obtain accurate
quotients from the buggy chips -- the completeness of the algorithms, with
the exact consideration of all the IEEE requirements, was a model for asm
programmers. He wrote one of the fastest Pentium FFT algorithms available
at the time. He rewrote various Pentium and alpha algorithms to improve the
scheduling beyond what we thought was possible. And the list goes on and
on.

But most important, he has been modest and tolerant, and has generally

conducted himself with the discipline implied in his SIG. It is very, very

hard to get him to say an unkind word about anyone. His low-key approach
belies his experience, and you aren't going to win popularity contests by
taking advantage of his good nature.

In short, it is probably a good time for you to relax, go for a walk, have a
few beers, and visit:

Doug Siebert

unread,

Feb 23, 2000, 3:00:00 AM2/23/00

to

fr...@genesis.demon.co.uk (Lawrence Kirby) writes:

>In article <88ujfd$1eup$1...@walter.acs.nmu.edu>
> craz...@hotmail.com "Carlo Razzeto" writes:

>>Watch the MS Keynote, they demoed a nice Unisis MainFrame with 16proc's
>>calculating 4.2 Billion air routs a second for the cheapest fairs.... I
>>doubt your home computer could do that...

>Nor would I want it to. I only wanted to get a flight from London

>to Manchester, trust Microsoft to turn it into a computing nightmare.

If it works as well as their MapQuest site, it'd have to calculate
billions of routes per second to find a way to route you through Canada
on a 4200 mile "Local Roads" segment in a 200 mile trip! Those who have
tried to use MapQuest to get driving directions within the US will
understand this one :)

--
Douglas Siebert Director of Computing Facilities
douglas...@uiowa.edu Division of Mathematical Sciences, U of Iowa

I'm planning on being dead for most of the new millennium, how about you?

Anne & Lynn Wheeler

unread,

Feb 23, 2000, 3:00:00 AM2/23/00

to

oh yes, and there was lots of fare tuning ... implementations that
had fares in database on both per flight segment basis as well as
origin/destination basis ... could see a million updates in a single
day to the fare database.

--
Anne & Lynn Wheeler | ly...@adcomsys.net, ly...@garlic.com
http://www.garlic.com/~lynn/ http://www.adcomsys.net/lynn/

Anne & Lynn Wheeler

unread,

Feb 23, 2000, 3:00:00 AM2/23/00

to

last time I solved this problem (been seven years), there were about
480k commercial flight segments on the OAG master ... and about 4080
airports worldwide with commerical flights. The 480k or so commercial
flight segments combined into something like 650k possible "flights"
(a multihop flight number with four flight segments would have 3
destinations from the starting point, two destinations from the first
stop, and one destination from the 2nd stop (3+2+1 flights). The
"worst" was a flight number with 16 flight segments that took off from
a regional airport and flew the rounds of some lessor known places
... eventually ending back at the original airport.

Another pet peeve of mine has been flight numbers that had
intermediate stops with "change of equipment" but not a connection
(this had certain advantages because the normal page &/or screen
listing shows "directs" before connects. A plane could take off from
someplace with at least two flight numbers ... at some intermediate
airport one of the flight numbers would be magically associated with a
different plane ... which involved changing "equipment" ... but it wasn't
listed as a connection.

connections were where things started getting tricky for routes (got
into a lot larger number of permutations). database implementations
giving all possible routes for all possible from/dest pairs tended to
handle directs and one connects ... but started to break down as soon
as you tried to do two connects.

then comes fares ... there can be a dozen different fares for flight
segments as well as at least dozen different fares between end-points

... and so various flight selection criteria can be (once all
possible routes have been determined between the origin/destination)

* "least expensive" ... based on either the end-to-end fare and/or the
aggregation of possible segment fares (with some of the fares have
varying seat availability on per segment basis or day of the week
basis)

* least travel time

* arrival before specifc time

* departure after specific time

* maximum travel points (longest distance travelled)

Zalman Stern

unread,

Feb 23, 2000, 3:00:00 AM2/23/00

to

In comp.arch Greg Alexander <gale...@sietch.bloomington.in.us> wrote:
: Me too. :) I'm disgusted to see anyone even SUGGESTING limiting
: individual programs to be unable to address all physical RAM in the same
: address space.

The forces of evil are strong and persistent. We must remain ever vigilant
in combating their multiferous attempts to push us backwards into the
burning pits of hell.

I guess if it weren't for being able to repeat the mistakes of the past,
computer architecture long ago would have reached optimality and become a
dead discipline.

-Z-

Terje Mathisen

unread,

Feb 23, 2000, 3:00:00 AM2/23/00

to

Leon Trotsky wrote:
> Troll, you've been elected to my blocked list.

That's bad, because it robs me of the chance to use one of my all-time
favourite quotes:

"I couldn't engage in a battle of the mind with you, 'cause it wouldn't
be fair to do so against an unarmed opponent."

(Or at least approximately that. Anyone have the correct version, and
original author? Oscar Wilde?)

Terje

--
- <Terje.M...@hda.hydro.com>

Using self-discipline, see http://www.eiffel.com/discipline

Paul Hsieh

unread,

Feb 23, 2000, 3:00:00 AM2/23/00

to

In article <jw6zOJBmS9BKOs...@4ax.com>, a...@s.netic.de says...

> On Tue, 22 Feb 2000 08:25:29 -0800, DONT.qed...@pobox.com (Paul
> Hsieh) wrote:
>
> >> However it should be possible to use a similar approach for hardware
> >> x86 implementations - rollback to a known state (e.g. start of trace
> >> cache sequence) and re-run the offending code in a special "one x86
> >> instruction at a time" mode.
> >
> >Well, I suspect that this would decrease the effective use of the micro-
> >op stream. If special roll back micro-ops were encoded along side each
> >speculation,
>
> A trace cache sequence does not necessarily cover just a single
> speculation. IMHO it would be pretty useless if it did.
>
> The trace cache is not indexed by program address except for the first
> in a sequence of trace cache bundles (after a branch misprediction or
> when the prior sequence terminated due to sequence length
> restriction). Further bundles use the next cache row with way index
> chaining (see Intel patent).
>
> It would be fairly natural to use this start of a sequence to
> checkpoint the microop stream. It doesn't matter how much speculative
> ops are in further down as long as all of them can be undone.

I think stores are very hard to undo if they leave the chip. They
claimed up to 24 store in flight, so that sets some kind of bound on
check pointing.

The problem is that any load or store may need speculative undo (since
they may fault) as well as divides, FP operations and so on. So
speculation is not that uncommon.

If they had sparse rolled back and check points and therefore had to undo
many, many ops on any fault as you suggest, then they will likely have
lost an unacceptably large amount of work. Putting the machine into a
"single execution" mode to recover to the point of failure, also doesn't
sound like a great solution since the pipeline is about 16 clocks past
the trace cache -- hence a sequence of just 7 ops would take more than
100 clocks just to retrace back to the point where the exception really
occurred, after it did whatever other work it needed to do to the badly
speculated instructions. As you know, there are some branches that just
can't be well predicted, and these loops crop up in real world code. So
having some sort of solution for not-outrageous branch misprediction
recovery is pretty much a must.

The scheme you suggest also makes handling other sorts of exceptions like
device interrupts, a different special case (since there is no "exception
point" to find.) I think that something as simple as knowing the EIP for
each micro-op's associated x86 instruction (via calculation or whatever -
- it really only takes 4 bits per instruction to communicate the length -
- actually it could probably even be 3 bits if they encoded instructions
over 8 bytes as two micro-ops which would be consistent with their P6
core) opens up the possibility for a much simpler and probably faster
solution.

From the point of the view of the back end of the machine I imagine that
upon any exception, the portion of the scheduler past the exception point
is simply marked as "incorrect" its outputs and further exceptions
ignored, and the trace cache is given an exception signal (along with the
offending EIP, or otherwise uniquely indexing internal "PC") -- then
results are re-accepted after the point where a "resumption micro-op" is
delivered from the trace cache. This "incorrect flag" and "resumption
micro-op" might be more easily implemented via an "age value" which is
simply a synchronized pair of low bit counters.

> [...] So if retirement can be deferred until the whole sequence is

> done, the mentioned commit/rollback is rather easy (roughly comparable
> to misprediction/exception treatment). It results in some kind of
> in-order backend queue and an upper limit to the length of a sequence.
>
> It doesn't really matter where you place the checkpoint, except that
> it must be at a x86 instruction boundary.

Well if anything, branch targets and store over-flow points seem like
logical choices if such an implementation were chosen (in Transmeta's
case I think that branch targets and load-store re-ordering points would
be when they would do it.)

> [...] IMHO basic blocks loose some significance with trace caches.

Well the branch taken bubble from the P6 disappears (i.e., branch-verify
micro-ops can probably be sent down the middle of a micro-op sequence
with no trouble) if that's what you mean, so all code can, in some sense,
be considered auto-inlined and auto-unrolled to at least a limited
degree.

Just to be clear about this all though -- this is just my gut feel on
this. Given that Willamette is so new, I really couldn't say anything
too deep about it for sure, as I think nobody can except people from
Intel or who otherwise have an NDA with them.

--
Paul Hsieh
http://www.pobox.com/~qed/cpujihad.shtml

Jan Vorbrueggen

unread,

Feb 23, 2000, 3:00:00 AM2/23/00

to

Zalman Stern <zal...@netcom2.netcom.com> writes:

> I guess if it weren't for being able to repeat the mistakes of the past,
> computer architecture long ago would have reached optimality and become a
> dead discipline.

I don't think so. An evolutionary optimum is contingent on the fitness
function - and thus the environment of the species - being constant in
time, or nearly so. This is definitely not the case in computer architecture.

Jan

Paul Hsieh

unread,

Feb 23, 2000, 3:00:00 AM2/23/00

to

In article <Yw6zOKGtk0kE71...@4ax.com>, a...@s.netic.de says...

> Some time ago I regularly used the OS/2 port of GCC (EMX) which used
> the ancient a.out format (which doesn't have a const section) and
> therefore it did not refrain from putting switch tables and string
> constants in the code section, right in the middle of the code. Of
> course this resulted in some performance penalty on all CPUs having
> split I/D caches, but it was limited to code/data conflicts within a
> fairly small cache line.
>

> However Willie is different. IMHO a trace cache cannot easily detect
> self modifying code. So W. instead seems to use the TLB for this
> purpose: A page cannot be located in both I- and D-TLB (Intel: "Do not
> put code and data on the same 4K page."). Maybe the whole trace cache

> has to be flushed if a store hits an I-TLB entry. I don't know how
> many legacy programs are affected, but those which are (plain old DOS?
> Win16? JIT compiler?) could end up being faster on the old Pentium.

Well given that the trace cache replaces the I-Cache, I don't see why it
wouldn't have the same sort of coherency protocals as old the I-Cache
implementation had. I think that flushing (well actually just
invalidating since you can't write to it) the whole trace cache would be
absolutely horrid in terms of performance.

In worst case scenarios of executing an inner loop that executes, a fair
number of instructions (say a 100 or so) and it is being slightly
modified (a single offset for example) by an outer loop. Basically
rather than just thrashing on a single cache-line's worth of bad ops, the
entire loop (and surrounding code) has been killed, and will be
continually re-killed. This example is not that unusual. Since some
algorithms take parameters that are constant for the duration of the
function, are continually read, and there is severe register pressure
(not an uncommon case in the x86 world.) A person may be implementing a
self modifying code implementation because its worth the penalties in P5
and P6 architectures.

It for reasons like this that I think the trace cache still behaves in
much the same way as a regular I-cache. The continuous unrolled micro-op
stream that makes the whole scheme worthwhile can be handled by widgets
within the trace cache mechanism. Indexing for branch targets are
necessary anyways, so it still makes sense to me that it maintain an
ordinary cache like structure, at least from an address access point of
view.

Paul Hsieh

unread,

Feb 23, 2000, 3:00:00 AM2/23/00

to

In article <38b2dc12$0$2...@nntp1.ba.best.com>, tro...@ussr.ru says...

> Terje Mathisen <Terje.M...@hda.hydro.com> wrote in message

> > [...] Having to use OS-level interfaces to remap memory in the middle of a

> > tight loop which searches a huge hash table or B-tree, is not something
> > I'l like to contemplate.
>
> You dont know enough about X86 or kernel architectures to argue with me.

You wouldn't happen to go by the name "Mr. Bigglesworth", or "Scott
Nudds" in an alternate/previous life would you?

--
Paul Hsieh
http://www.pobox.com/~qed/funny.html

Terje Mathisen

unread,

Feb 23, 2000, 3:00:00 AM2/23/00

to

John Gardner wrote:
> This Terje opened his post by immediately flaming the poor dead Commie.
>
> As far as I can tell, Trotsky never provoked Terje, other than daring
> to use a multi-line sig in Terje's newsgroups.

John, since Leon has killfile'd me, I can safely reply:

If you look back, you'll notice that I tried to tell Mr Trotsky (who's
news headers suggest that he's either from Berkeley, or attached to
best.com, definitely not 'ussr.ru') several times that having a
bank-switched memory layout, with OS-level intervention needed for any
boundary crossing, would not be a good solution for programs needing a
lot of RAM.

The .sig file comment was a small PS at the very end, which I wouldn't
have mentioned if I hadn't been a little irritated. Sorry about that.

> Terje Mathisen Terje.M...@hda.hydro.com wrote,
> >
> > Leon, please grow up a bit.

I wrote that when he kept insisting that bank-switching was just as good
a flat addressability.

> Flame those damned Commie bastards.

That's almost funny. Since I'm norwegian and he's american, I'm living
in a country which is much, much closer to the communist ideals than
even Sovjet/Russia has ever been.
:-)

Ketil Z Malde

unread,

Feb 23, 2000, 3:00:00 AM2/23/00

to

"Carlo Razzeto" <craz...@hotmail.com> writes:

> After a billion dollars development, I want to see what they
> have....

Me too!

> I think Windows could potentially be an alternative mainframe
> operating system,

Yeah, those mainframe guys never would accept Unix, since it's not
stable enough, doesn't provide the IO or the security. Windows is
surely just what they've been looking for. It's even C2 certified!
And it has a "Start"-button.

> though it is their first attempt to break into that market.

Yes, after they've replaced all servers in the low and mid range, the
mainframe market sounds like the road ahead. And who needs Parallell
Sysplex when you can share a SCSI bus?

> However apparently Unisys believes in it enough to make a product

> based on it, so that says somthing to me...

Seriously, I think it's mostly a marketing ploy, just like the C2
certification, or the various creative web or database benchmarks. I
wouldn't hold my breath for MS-based mainframe systems.

-kzm
--
If I haven't seen further, it is by standing in the footprints of giants

Bill Todd

unread,

Feb 23, 2000, 3:00:00 AM2/23/00

to

Terje Mathisen <Terje.M...@hda.hydro.com> wrote in message

news:38B38CBC...@hda.hydro.com...

> Leon Trotsky wrote:
> > Troll, you've been elected to my blocked list.
>
> That's bad, because it robs me of the chance to use one of my all-time
> favourite quotes:
>
> "I couldn't engage in a battle of the mind with you, 'cause it wouldn't
> be fair to do so against an unarmed opponent."

My recollection is that the operative phrase was 'a battle of wits with an
unarmed opponent' (which sounds a bit more like Wilde, but I don't know the
attribution).

This discussion does make me nostalgic for PDP-11 overlays (especially
memory-mapped ones), however. I used to believe that you could do anything
worth doing in 16 bits, as long as you had enough physical memory to flip
around, so the idea that 4 GB of virtual memory may not be enough still
strikes me as a bit ludicrous. Of course, people are in much more of a
hurry nowadays.

- bill

>
> (Or at least approximately that. Anyone have the correct version, and
> original author? Oscar Wilde?)
>

Bill Todd

unread,

Feb 23, 2000, 3:00:00 AM2/23/00

to

Ketil Z Malde <ke...@ii.uib.no> wrote in message
news:KETIL-vk1k...@eris.bgo.nera.no...

...

> Seriously, I think it's mostly a marketing ploy, just like the C2
> certification, or the various creative web or database benchmarks. I
> wouldn't hold my breath for MS-based mainframe systems.

Perhaps we should start by holding our collective breath for an MS-based
multi-user system, since that tends to be a stepping-stone on the way to
mainframehood. I'd say we'd first need to wait for an MS-based system,
period, but some might consider that too harsh a judgement on their current
offerings.

- bill

Larry Elmore

unread,

Feb 23, 2000, 3:00:00 AM2/23/00

to

"Brian Jonathan Lee" <b...@eecg.toronto.edu> wrote in message
news:2000Feb21.2...@jarvis.cs.toronto.edu...

> In article <38AC4C5B...@bellatlantic.net>,
> Jeffrey S. Dutky <du...@bellatlantic.net> wrote:
> >how about:
> >
> >byte = 1 byte (8-bits)
> >half-word = 2 bytes (16-bits)
> >word = 4 bytes (32-bits)
> >long-word = 8 bytes (64-bits)
> >deca-word = 10 bytes (80-bits)
> >quad-word = 16 bytes (128-bits)
> >octa-word = 32 bytes (256-bits)
>
> this one makes the most sense! At work we actually say "d-word" (double

> word) for 8bytes though. get's me so confused when someone says "hey

look!
> it's a 4 quad word transfer!"

However, shouldn't deca-word actually be deca-byte? (deca-_word_ would be
160 bits, no?)

Larry

Andreas Kaiser

unread,

Feb 23, 2000, 3:00:00 AM2/23/00

to

On Tue, 22 Feb 2000 17:57:49 -0800, "Leon Trotsky" <tro...@ussr.ru>
wrote:

>There should be no functional/performance issues as long as the
>string constant is only read.

But then each and every store cycle has to be snooped against the
I-cache (could be 2 per cycle, for P5x and K7) or you need a different
consistency protocol between I- and D-cache. Indeed it is possible,
however is has not been done yet.

All x86 chips having split I/D caches I know of, define that a cache
line is never included in both I-cache and D-cache. So just cache miss
cycles have to be snooped, with flush/castoff and reload following.
The K5 is somewhat special as it does not flush an I-cache line on a
data read miss hitting the I-cache, the read is done uncachable
instead (a good idea for switch tables, but a bad one for strings).

>What I-TLB entry if the X86 has paging disabled?

Same behaviour. Just the way the entries are replaced changes a bit,
no table walk but a fabricated entry derived from the linear address
is used.

>What I-TLB entry if MOV CR3 was previouly executed?

In my model, a CR3 reload implies a flush of the trace cache, at least
of those contents associated with non-global TLB entries. But the
trace cache won't likely map a lot of different code bytes anyway,
because it needs much more storage for a x86 instruction as a classic
cache and additionally the same x86 instruction may belong to several
different trace sequences which all belong to the same code but start
at a different x86 instruction.

>Because X86 caches have always been physically addressed,

No. The K5 used linear cache tags.

>I would guess that the trace cache entries are tagged with translated
>physical addresses that point within previously executed code segments.

A trace cache isn't just a different name for a classic I-cache
because the contents of a cache block do not belong to physically
adjacent instruction bytes. Lets assume that a trace cache block
contains 2 taken branches. The microop following the branch slot is
the first microop at the branch target. Since each microop may belong
to a x86 instruction which spans a cache line boundary, the trace
cache block may belong to at least 6 different L2 cache lines with 5
of them being completely independant of the set index which implies a
fully associative trace cache tag implementation having 6 tags per
block - even if 5 of those tags are incomplete, it still is beyond all
reason.

There is a different solution, based on the I-TLB, which is backed by
Intels recommendation to avoid mixing code and data in the same 4KB
page. Let's assume that the L1 I-TLB has 24 entries like the K7. Then
each trace cache block needs a 24-bit mask with a bit set for each TLB
entry which is responsible for at least part of the x86 instruction
bytes belonging to the contained microops. If an I-TLB entry changes,
for whatever reason, all trace cache blocks having the corresponding
bit set are flushed. Aliasing doesn't matter, being read-only. And the
I-TLB snoops data miss cycles physically, at least data store misses.

But indeed, the D-cache may use the "shared" MESI state for lines
which hit the I-TLB on reload, so that stores still miss and may be
snooped. Which lessens the problem significantly, but still JIT
engines suffer.

Gruss, Andreas

Andrew Reilly

unread,

Feb 23, 2000, 3:00:00 AM2/23/00

to

"Brian Jonathan Lee" <b...@eecg.toronto.edu> wrote in message
news:2000Feb21.2...@jarvis.cs.toronto.edu...
> In article <38AC4C5B...@bellatlantic.net>,
> Jeffrey S. Dutky <du...@bellatlantic.net> wrote:
> >how about:
> >
> >byte = 1 byte (8-bits)
> >half-word = 2 bytes (16-bits)
> >word = 4 bytes (32-bits)
> >long-word = 8 bytes (64-bits)
> >deca-word = 10 bytes (80-bits)
> >quad-word = 16 bytes (128-bits)
> >octa-word = 32 bytes (256-bits)
>
> this one makes the most sense! At work we actually say "d-word" (double
> word) for 8bytes though. get's me so confused when someone says "hey look!
> it's a 4 quad word transfer!"

Just to heap ashes on the pyre:

All of this talk of bytes is missing the point. A "word" is a unit
that pre-dates bytes, and is still widely used in its original
sense. That is, effectively the same as C's "int": whatever is
convenient for the machine. Now sure, some modern, byte-addressed
machines can do lots of things conveniently, and that confuses the
issue, but just peek into the non-byte-addressed realm of DSP
processors, super computers and some mainframes, and it is clear.

A double (or long) word is a double word because you (well, the
processor) has to do twice (or four times, for multiplies) as much
work on it per operation as a word. Sometimes explicitly, sometimes
in microcode, throughput or latency.

A half word is a half word because you still operate on the word
(because that's convenient) but you don't care what the top half
does.

This ignores extended-precision floating point registers, because
they feel a bit like words, but sometimes take two operations to
load or store.

--
Andrew

Bill Todd

unread,

Feb 23, 2000, 3:00:00 AM2/23/00

to

Bill Todd <bill...@foo.mv.com> wrote in message
news:8909in$bms$1...@pyrite.mv.net...

>
> Terje Mathisen <Terje.M...@hda.hydro.com> wrote in message
> news:38B38CBC...@hda.hydro.com...
> > Leon Trotsky wrote:
> > > Troll, you've been elected to my blocked list.
> >
> > That's bad, because it robs me of the chance to use one of my all-time
> > favourite quotes:
> >
> > "I couldn't engage in a battle of the mind with you, 'cause it wouldn't
> > be fair to do so against an unarmed opponent."
>
> My recollection is that the operative phrase was 'a battle of wits with an
> unarmed opponent' (which sounds a bit more like Wilde, but I don't know
the
> attribution).

Or it could have been Marx (Groucho), which somehow seems appropriate.

- bill

Sander Vesik

unread,

Feb 23, 2000, 3:00:00 AM2/23/00

to

In comp.arch Leon Trotsky <tro...@ussr.ru> wrote:
> Sander Vesik <san...@haldjas.folklore.ee> wrote in message > In comp.arch Leon Trotsky

> <tro...@ussr.ru> wrote:
>> > Terje Mathisen <Terje.M...@hda.hydro.com> wrote in message

>> > news:38B23C39...@hda.hydro.com...
>> >> Leon Trotsky wrote:
>> > "flipping 64KB" in MSDOS isn't the same situation as "flipping 4GB"
>> > on a UNIX fileserver.
>>
>> No. It is much worse. Way much worser.

> I have heard EMS page flipping was too frequent on MSDOG.

And frequent 4Gb flipping is much worser.

>> >> Having to use OS-level interfaces to remap memory in the middle of a
>> >> tight loop which searches a huge hash table or B-tree, is not something
>> >> I'l like to contemplate.
>>
>> > You dont know enough about X86 or kernel architectures to argue with me.
>>

>> He *DOES*.

> Terje seems to be merely a dabbler in X86 microprocessors.

Check it out. Dejanews is your friend.

>> > WRT to a big X86 fileserver, the filesystem driver would run in kernel mode (CPL0),
>> > so it wouldnt need to call a "OS-level interface": its part of the OS kernel.
>>
>> It still needs to call OS level interface, even if it itself is part
>> of the OS kernel.

> Ever heard of inline asm code?

Yes. It is still big, ugly and if it is indeed inline asm, it will be in a
hundred of places, which is very bad.

>> It needs to call a big icky thing and that sounds

> "big icky thing": you don't know what you are talking about.

Ever tried actually writing code to do what you claim is easy?

>> really wrong inside a tight loop. And wastes a lot of time.

> No, what you think of as a tight loop would be a loop that
> scans the 4GB virtual space. The code to change virtual spaces
> would be somewhere in an outer outer loop, hardly ever executed,
> by the inherent nature of a disk cache.

Either:
a. you have a really strange view of what a modern advanced
diskcache (actually, filecache - nobody caches disks instead
of file data and metadata) looks like on the inside or what
and how it does. Indeed, a view having nothing in common with
reality and thus utterly irrelevant.
b. we would like to see pseducode for the loops.

Not to mention that using memory just as a cache for the filesystem is a
rare (though not unexisting) case. It is much better used as cache for
the database data and indexes, kept by the database program.

>> Well, I might see how one could be lead to believe some things you have
>> said, but not that Terje is trolling...

> Who are you to speak for Terje's motives? Are you Terje?

I have seen his past performance, if you will. And, no, I am unfortunately
not Terje.

> The flame bait Terje threw into this technical discussion was obvious,
> and disappointing.

I saw no flamebait.

> //////\\
> ///_ _\\\
> _| _\ /_ |_
> |.|-(.)-(.)-.| Leon Trotsky
> \| J |/
> \ =###= /
> \ .--. /
> ",###,/
> "#"

--
Sander

There is no love, no good, no happiness and no future -
these are all just illusions.

Kjetil Torgrim Homme

unread,

Feb 23, 2000, 3:00:00 AM2/23/00

to

[Jeffrey S. Dutky]

> I much prefer my second suggestion, which names data sizes
> for the number of bytes in the element:
>
> byte = 1 byte = 8-bits
> dual-byte = 2 bytes = 16-bits
> quad-byte = 4 bytes = 32-bits
> octa-byte = 8 bytes = 64-bits
> hexa-byte = 16 bytes = 128-bits

Oops:

$ webster hexa-
hexa- or hex- cf prefix [Gk, fr. hex six -- more at SIX]
1: SIX <hexamerous>
2: containing six atoms, groups, or equivalents <hexane>

It would be better if you used the power of two in the naming.

dyobyte 2 bytes
tribyte 4 bytes
tetrabyte 8 bytes
pentabyte 16 bytes
hexabyte 32 bytes
heptabyte 64 bytes
octabyte 128 bytes
enneabyte 256 bytes
decabyte 512 bytes

I doubt hendekabyte for 1024 bytes ever will be a big hit.

Kjetil T.

Greg Alexander

unread,

Feb 23, 2000, 3:00:00 AM2/23/00

to

In article <88vbou$2063$1...@walter.acs.nmu.edu>, Carlo Razzeto wrote:
>You should, I made a mistake.... perday... Sorry about that... But still...
>Point is MS is making an initaly impressive segway into the MainFrame
>market.... Now I know it may seem sacraligious to think that a company like
>microsoft could break into such a space, but you should keep your minds
>open... Not only UNIX can run a mainframe...

Only NT can never run a mainframe. *smirk*

Greg Alexander

unread,

Feb 23, 2000, 3:00:00 AM2/23/00

to

In article <88vjjr$29ef$1...@walter.acs.nmu.edu>, Carlo Razzeto wrote:
>After a billion dollars development, I want to see what they have.... I
>think Windows could potentialy be an alternitive MainFrame operating system,
>though it is their first attempt to break into that market. I'm going to
>wait and see what the reviews are when data center comes out... However
>apperantly Unisis beleives in it enough to make a product based on it, so

>that says somthing to me...

Unisis, without a doubt, believes in money. This is commerce, not
engineering or religion.