I have a question for you:
Can the lack of bit field instructions in the x86 instruction set be
explained by patents held by other cpu designers like motorola ?
Or is there another reason why x86 instruction set is missing bit field
instructions ?
Bye,
Skybuck.
> Can the lack of bit field instructions in the x86 instruction set be
> explained by patents held by other cpu designers like motorola ?
> Or is there another reason why x86 instruction set is missing bit field
> instructions ?
The 8008 didn't have them.
The 8080 was designed as an extended 8008.
The 8086 was the 8080 extended to 16 bits.
Also, you can do bit operations with AND and OR instructions.
-- glen
What kind of bit field instruction(s) is lacking?
See motorola processor or nec processor.
InsertBits, ExtractBits, stuff like that ;)
Bye,
Skybuck.
Yes but only one 8, 16, 32 bits all at once at fixed positions.
To manipulate only a few of those bits requires using masks, shifting,
and-ing, not-ing, or-ing, xor-ing etc.
Which can become quite a lot of instructions to do something "relatively
simply" like "insert" a few bits into some memory bit location.
(I wouldn't call it insert though... just "move/copy bits into a memory bit
location/offset".
A single clock/cycle instruction would be nice.... (or 1 latency in current
terms ;))
Bye,
Skybuck.
So that means that BT, BTC, BTR and BTS were not sufficient? Your bitfields
must be larger?
> Can the lack of bit field instructions in the x86 instruction set be
> explained by patents held by other cpu designers like motorola ?
Certainly not, since bit-fields are both obvious and have been implemented
in a variety of machines for the past 50 years or so.
> Or is there another reason why x86 instruction set is missing bit field
> instructions ?
They aren't worth it. On the few occasions when you need the facility, a
mask and shift takes just two more instructions. (For comparisons or
arithmetic on similar bit-fields, you can even omit the shift step.)
Rotate through carry may be used to insert and delete single bits. If
you need many, then you either loop or use masks and all. SHLD/SHRD
may be helpful too, btw.
I think the reason why you don't see such advanced instructions on old
or primitive processors is because they're old and primitive. You
should feel lucky to have (I)MUL and (I)DIV. :) When I began to
program in ASM on Z80 (enhanced i8080) I didn't have this luxury.
Also, the CPU ran at only about 3 MHz and instructions took several
clocks. So, shifts, adds, LUTs -- all came handy.
Alex
If it really was the 8080 extended to more bits it would have been a
better processor. You would have been able to address 32 bits with
two 16 bit registers and not have to wait for a trip through the ALU
on every memory operation.
> Also, you can do bit operations with AND and OR instructions.
The 8086 also had the "test" instruction to do an AND but not store
the results
>
> -- glen
Could you please explain why, in your opinion, instructions such as
the following 8 bit instructions as well as the respective 16 and 32
bit do not fit the bill?
OR AL,bbbbbbbbB
AND AL,bbbbbbbbB
XOR AL,bbbbbbbbB
TEST AL,bbbbbbbbB
where "b" represents a zero or one bit.
> I think the reason why you don't see such advanced instructions on old
> or primitive processors is because they're old and primitive. You
> should feel lucky to have (I)MUL and (I)DIV. :) When I began to
> program in ASM on Z80 (enhanced i8080) I didn't have this luxury.
> Also, the CPU ran at only about 3 MHz and instructions took several
> clocks. So, shifts, adds, LUTs -- all came handy.
Z80B, what a large chip, you lucky, lucky man. 6502 or even
http://nibz.googlecode.com
cheers jacko
>Skybuck Flying wrote:
>
>> Can the lack of bit field instructions in the x86 instruction set be
>> explained by patents held by other cpu designers like motorola ?
>
>> Or is there another reason why x86 instruction set is missing bit field
>> instructions ?
>
>The 8008 didn't have them.
Because the 4004 didn't have them.
x86 is an entirely brain-damaged architecture. It's a good thing that
Intel has superb process technology and zero business ethics,
otherwise x86 would be a long-forgotten joke.
John
No
> Or is there another reason why x86 instruction set is missing bit field
> instructions ?
Barcelona (AMD) introduced 5 (or was it 7) bit manipulation
instructions.
Mitch
Yeah, tragic mistakes, selecting Intel and Microsoft. It could have
been Motorola and DR.
John
Wouldn't have been Motorola, but ignoring DR was an oversight.
function KeepLowBits( Value : longword; Bits : longword ) : longword;
inline;
begin
Result := Value; // 32 bits case.
if Bits <= 31 then
begin
Result := Result and not (4294967295 shl Bits); // shl instruction limited
to 31.
end;
end;
function ShiftLeft( Left : longword; Right : Longword; Shift : longword ) :
longword;
asm
shld eax, edx, cl
end;
procedure WriteLongwordBits( Value : longword; Bits : longword; DestAddress
: pointer; DestBitIndex : longword );
var
vContent : longword;
vMask : longword;
vShift : longword;
vFirstContent : longword;
vSecondContent : longword;
vFirstMask : longword;
vSecondMask : longword;
vFirstAddress : longword;
vSecondAddress : longword;
begin
vContent := KeepLowBits( Value, Bits );
vMask := KeepLowBits( 4294967295, Bits );
vShift := DestBitIndex and 7;
vFirstContent := ShiftLeft( vContent, 0, vShift );
vSecondContent := ShiftLeft( 0, vContent, vShift );
vFirstMask := ShiftLeft( vMask, 0, vShift );
vSecondMask := ShiftLeft( 0, vMask, vShift );
vFirstAddress := longword(DestAddress) + (DestBitIndex shr 3); // div 32
vSecondAddress := vFirstAddress + 4;
Plongword(vFirstAddress)^ := (Plongword(vFirstAddress)^ and not vFirstMask)
or vFirstContent;
Plongword(vSecondAddress)^ := (Plongword(vSecondAddress)^ and not
vSecondMask) or vSecondContent;
end;
Bye,
Skybuck.
That doesn't stop people from filing patents on bit field instructions.
Maybe intel/amd is scared of law suits ?
>
>> Or is there another reason why x86 instruction set is missing bit field
>> instructions ?
>
> They aren't worth it. On the few occasions when you need the facility, a
> mask and shift takes just two more instructions. (For comparisons or
> arithmetic on similar bit-fields, you can even omit the shift step.)
This will work up to a certain point... mostly when it's possible to
shift-or bits into single memory cell or registers.
As soon as multiple memory cells have to be overwritten things can get quite
nasty... especially if bits need to be preserved...
So I don't agree with you.
A simple example where "extract bits" instruction could be usefull is for
huffman decompression... where huffman codes can have a number of variable
bit fields stuck next to each other.
Extracting those bit fields (huffman codes) requires multiple x86
instructions, which slows down the huffman decoder.
A single instruction to do that would be preferred and would
probably/possibly give higher decoding speed and this is just one but an
important example ! ;)
Bye,
Skybuck =D
And the original (A stepping) '386 had Insert Bits and Extract Bits
instructions. Apparently Intel needed the microcode space, and pulled
them in the B steppings.
I programmed the tiny i8051 as well. Even though it does have MUL, DIV
and even bit addressing of the RAM's bytes (btw, just 128 bytes of
internal RAM for stack and variables -- how cool is that?:), it's far
more inferior than Z80.
Alex
Single bit "fields" are totally inadequate, and the x86 instructions
rather more so.
Mitch
Totally inadequate *FOR WHAT*? Why are the instructions you propose
better than shift and mask?
-hpa
Simple answer:
One is CISC, and the other is RISC.
> Can the lack of bit field instructions in the x86 instruction set be
> explained by patents held by other cpu designers like motorola ?
The IBM 7030 computer, STRETCH, made out of discrete transistors, had
bit field instructions.
But most computers don't. The IBM 360 didn't. The x86 architecture
started out from the 8086, a 16-bit architecture built to look as much
like an 8-bit 8080 as possible... and so when it grew, it added the
most important and popular operations, like floating-point arithmetic.
Bit-field operations are regarded as very special-purpose.
John Savard
> Yeah, tragic mistakes, selecting Intel and Microsoft. It could have
> been Motorola and DR.
It could have been Motorola. Or it could have been Intel and Digital
Research. Oh, there _was_ a CP/M 68K, but that would have been very
unlikely...
Anyways, there *was* the Macintosh. Apple, not IBM, is at fault for
playing its cards so badly as to hand Microsoft the Windows monopoly.
John Savard
Yes, they did, and it was not necessarily a big mistake.
Apple DID make a HUGE mistake, however, by price point alone.
That mistake will be their curtain call in this economy. I would not
be buying Apple stock any time soon.
Intel had their own OS called ISIS back then. It was used on there
8080 based development systems. In some ways it was more advanced
than DOS or CP/M of the same era. It had device independent I/O so
that the printer etc was handled just like a file. There was no need
for special entry points for printing.
ISIS also did a fairly cute thing on its command line parsing. It
only parsed the name of the program to be run and ran it. It left
the :CI: (stdin) pointing on the line just after the name. The meant
that if you redirected the input to a file, it was easy to implement
IF functions etc for script files.
The passing of parameters to script files was a little clunky but it
worked. The whole file was processed and the parameters were inserted
making a new file that was then used as the input to the command line
code.
Obviously to make an efficient Intercal compiler, the machine should
have a native Interleave instruction.
http://www.muppetlabs.com/~breadbox/intercal-man/
>
> Mitch
> A simple example where "extract bits" instruction could be usefull is for
> huffman decompression... where huffman codes can have a number of
> variable bit fields stuck next to each other.
>
> Extracting those bit fields (huffman codes) requires multiple x86
> instructions, which slows down the huffman decoder.
>
> A single instruction to do that would be preferred and would
> probably/possibly give higher decoding speed and this is just one but an
> important example ! ;)
It's not an important example and almost certainly doesn't support your
case anyway. For one thing, Huffman decoding is probably limited by memory
bandwidth and not CPU speed. More importantly, though, any really complex
instruction would either be implemented in microcode or would do something
really awful to the pipeline, resulting in a huge performance hit for
every other workload.
You mention the program being limited by bandwidth.
That more or less proves my point.
Many computers programs nowadays use 8 bits, 16 bits, 32 bits where maybe
only 24 bits where needed or 23 bits, or 22 bits or whatever.
Nowadays these programs might be wasting bandwidth by using too much space
for the storage:
Examples:
var
DataArray : array[0..100000000000000000000000] of integer;
versus:
var
DataArray : array[0..100000000000000000000000] of 22 bits;
The ammount of bandwidth savings and therefore speed gains is left as an
exercise for the reader ! ;)
Bye,
Skybuck.
"Skybuck Flying" <Blood...@hotmail.com> wrote in message
news:5b875$49b57d96$d5337e4d$77...@cache5.tilbu1.nb.home.nl...
> Nowadays these programs might be wasting bandwidth by using too much space
> for the storage:
Yeah, and they might not be. Go look up some classic CISC vs. RISC history --
adding more instructions pretty much always makes you give up something else,
and while just adding a few bit-oriented instructions is not going to be that
significant, you can really get carried away to the point where your assembly
language is almost as fancy as something like C, at which point performance
almost always suffers.
Also keep in mind that when the 8086 was designed, memory was pretty much as
fast as the CPU itself... although it cost of a lot of money. These days,
memory is dirt cheap... but it's hundreds of times slower than the CPU core.
---Joel
I doubt it is for patent reasons. The DEC VAX had bit field instructions also.
Alexei A. Frounze wrote:
> When I began to program in ASM on Z80 (enhanced i8080) I didn't have this luxury.
(of MUL and DIV)
Hm! And when I started (pre Z80, and Bill and the Steves were were
playing with trains) there wasn't ADD or SUBTRACT either; arithmetic
was in BCD and all math was by table look-up for ADD and SUBTRACT and
subroutines using these for MUL and DIV; the screen was an
oscilloscope and the keyboard a teletype.
On the other hand... *addressing* the data might cost you more than
the memory-to-CPU bandwidth savings would pay for.
Indexing an array which is word-wide (or byte-wide or doubleword-wide)
is usually very fast. At worst you can convert from an array-index to
a memory-offset simply by shifting the value left or right by one or
two bits... an extremely fast operation. In some cases, CPUs will do
the necessary shifting for you as part of the load/store instruction.
For dealing with "packed" arrays of odd lengths, such as you are
suggesting, the index-to-address calculation is often very much more
expensive. If you want to access 21-bit data items through a "load
the right word(s) and extract the bits" methodology, then you need to
be able to convert each array index into:
Memory word index
Starting bit
Ending bit
and then load one or two words from memory and do the bitfield-extract
based on the computed bit numbers.
This turns out to be an expensive operation, as it will typically
require at least a multiply operation... which on many processors is
significantly slower than a shift. It may also require a divide or
modulus... and these can be even slower.
If your're lucky enough to be accessing your array sequentially most
of the time, you may be able to optimize some of the sequential
operations to avoid the need to multiply, and you may gain a
performance advantage in these cases.
However, if you're mostly random accesses to the array, then you're
unlikely to avoid many memory accesses by packing the array, and the
cost of doing the multiply-and-mask/modulus operation for the index
math may very well cost you more cycles than the few you'll save on
hitting the same memory location twice.
Oh... reaching back to your original question as to why the X86
instruction set doesn't have a powerful set of arbitrary-bitfield
manipulation instructions... I'd put it down to the following factors:
- Intel has always favored backwards-compatibility in their
instruction set. They don't like making incompatible changes, and
haven't had terribly good luck in the market when they did try
(read up on the IA-84/Itanium architecture and its instruction set,
which did *not* achieve great market success).
- The X86 instruction set architecture has its roots back in the days
of much simpler processors, with far lower silicon density than is
available today.
- In the terms of those days and silicon processes, the sort of
arbitrary-bitfield manipulation instructions you are asking about
were quite expensive. They require doing both a programmable
bit-shift, and a programmable masking operation, each of which
requires quite a lot of gates (again, by the standards of that
era) and thus a lot of square area on the CPU chip.
- These instructions would probably require duplicating a significant
amount of silicon logic which would already exist for other
purposes (e.g. for the shift instructions). Re-use might not
be possible.
I believe that at the time these CPUs were invented, the designers
felt that they had better uses for their available silicon-space
and power budget. The feature you're asking for was simply not used
enough to win the battle for silicon space.
The silicon-area/complexity equation has changed a lot since then, of
course, and is probably not a factor at all these days. Concern about
breaking backwards compatibility, or a desire to use the limited
amount of previously-unused X86 instruction set space for other
purposes of greater benefit (e.g. adding MMX and the like) is probably
more of a factor.
--
Dave Platt <dpl...@radagast.org> AE6EO
Friends of Jade Warrior home page: http://www.radagast.org/jade-warrior
I do _not_ wish to receive unsolicited commercial email, and I will
boycott any company which has the gall to send me such ads!
Hmm... somehthing like the old IBM CADET -- "Can't Add, Doesn't Even Try?"
Huffman is a perfect example, we used to have poster here that would
come back every few months with some imagined need for a multi-bit
extraction opcode, and each and every time I was able to show portable
code (C) that would do the same thing, and be faster:
For Huffmann it is actually quite simple:
If you have many short tokens (which you hope for, otherwise Huffman is
a bad choice, right?), then a table-driven approach can average more
than one token per iteration, by looking up the next N bits and extract
one or more tokens.
For those very long, but also very rare tokens, you use multiple table
levels.
I happen to know that this particular method works very well, since I
used it when writing/optimizing the world's fastest ogg vorbis decoder. :-)
BTW, in Vorbis I never get more than one token, but it is still plenty
fast enough, with most branch instruction predicting very well,
particularly in 64-bit mode. This is due to having 64 bits in the decode
buffer, so I can run many iterations of the decoder between each refill
of that register.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
IBM 1620 Model 1, huh?
IIRC, the Model 2 had a adder.
--
ArarghMail903 at [drop the 'http://www.' from ->] http://www.arargh.com
BCET Basic Compiler Page: http://www.arargh.com/basic/index.html
To reply by email, remove the extra stuff from the reply address.
Even without cache, DDR2 peak memory bandwidths of 6400 MB/s are
readily acheivable for sequential reads. Assuming a 3 GHz single core
might eat 6000 MB/s in instructions and 6000 MB/s in data, the speed
ratio is hardly in the hundreds. With typical locality largly
absorbed (maybe 75%) in cache, the required memory bandwith is not
more than 6000 MB/s peak and 3000 MB/s sustained. Nicely within the
available speed of current memory.
I was reading along expecting the Intel 432 to be mentioned... and
you came up with the Itanium. Oh well, possible better known.
Cheers,
Steve N.
Intel has tried to burn all records that there ever was a 432. It was
proposed just at the same time as the 286 came out. Its performance
sucked. N 286s would always outperform N 432s and do so on far less
power.
It is unfortunate that we are stuck going down the x86 path but that
is the one where everything seems to have gone.
MooseFET wrote:
> Intel has tried to burn all records that there ever was a 432. It was
> proposed just at the same time as the 286 came out. Its performance
> sucked. N 286s would always outperform N 432s and do so on far less
> power.
IIRC the i432 was actually a chipset comprising of 3 different ICs,
whereas 286 was a single chip.
> It is unfortunate that we are stuck going down the x86 path but that
> is the one where everything seems to have gone.
It is weird that the virtual memory and the memory protection was
available since 286 yet neither OSes no applications use it because it
comes against the programming paradigm of the flat memory.
Vladimir Vassilevsky
DSP and Mixed Signal Design Consultant
http://www.abvolt.com
Yeah, I was doing a bit too much hand having there (and "100 times" is
certainly too large). What I meant was that *latency* to memory can be
upwards of 100 CPU cycles these days, when all the fancy caching and
predictive fetching algorithms don't happen to have already moved the RAM data
closer to the CPU core by the time that it's needed. E.g., you can still
write a program that makes a 3GHz Pentium perform no faster than a 300MHz
Pentium by executing "worst case" memory access patterns... although I agree
that in real world applications that doesn't happen.
---Joel
> It is weird that the virtual memory and the memory protection was
> available since 286 yet neither OSes no applications use it because it
> comes against the programming paradigm of the flat memory.
The segmentation was used quite extensively if you actually had a 286,
e.g. XMS (DOS) or 286-mode (Windows 3.1). But the 8086 didn't have it, and
the 386 offered a 32-bit flat address space, with memory protection
available on pages, so everything used that rather than segmentation.
But the 286 had such a short reign. Initially, most applications retained
8086 compatibility, so any use of 286-specific features tended to be
isolated. By the time that the 8086 really started dying off as a viable
platform, the 386 was out. So the 286 basically got leap-frogged.
Weren't there a year or two of 286-based PC's? The XT?
...Jim Thompson
--
| James E.Thompson, P.E. | mens |
| Analog Innovations, Inc. | et |
| Analog/Mixed-Signal ASIC's and Discrete Systems | manus |
| Phoenix, Arizona 85048 Skype: Contacts Only | |
| Voice:(480)460-2350 Fax: Available upon request | Brass Rat |
| E-mail Icon at http://www.analog-innovations.com | 1962 |
I love to cook with wine Sometimes I even put it in the food
snip
> It is unfortunate that we are stuck going down the x86 path but that
> is the one where everything seems to have gone.
ISTM that most of the arguments against the X86 come down to
"aesthetics". While I agree that it is ugly, it has shown itself to be
capable of very high performance and extensibility (new instructions,
wider addressability, new addressing modes, etc). You can do low power,
but perhaps not as low as a different architecture 64 bit chip could be,
but I suspect the difference would be modest.
So other than aesthetics, what is wrong with X86?
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
snip
> Weren't there a year or two of 286-based PC's?
I don't know for how long, but there certainly were some, both BY IBM
and Compaq, and probably others.
> The XT?
No, the XT was the 8088 with a hard disk. The 286 based models were the AT.
Stephen Fuld wrote:
> Jim Thompson wrote:
>
> snip
>
>
>> Weren't there a year or two of 286-based PC's?
>
>
> I don't know for how long, but there certainly were some, both BY IBM
> and Compaq, and probably others.
>
> > The XT?
>
> No, the XT was the 8088 with a hard disk. The 286 based models were the
> AT.
There used to be the XT 286 models as well. The same XT architecture yet
using the 286 CPU.
Stephen Fuld wrote:
> MooseFET wrote:
>
>> It is unfortunate that we are stuck going down the x86 path but that
>> is the one where everything seems to have gone.
>
>
> ISTM that most of the arguments against the X86 come down to
> "aesthetics". While I agree that it is ugly, it has shown itself to be
> capable of very high performance and extensibility (new instructions,
> wider addressability, new addressing modes, etc). You can do low power,
> but perhaps not as low as a different architecture 64 bit chip could be,
> but I suspect the difference would be modest.
The x86 is not too bad. Actually it has its roots from i8080. I can only
imagine if, say, PIC16 architecture was selected as a base for PC. How
would you like the dual core 4GHz PIC ?
> So other than aesthetics, what is wrong with X86?
1. The x86 instruction set with the variable command length is very
inconvenient for pipelining and speculative execution. That results in
the overcomplication of the modern CPU hardware.
2. The non-orthogonal set of registers makes the code optimization a
non-tractable problem.
Not at all: 16-bit x86 code has so many registers statically allocated
that a compiler pretty much has all those decisions fixed up front.
I.e. with SI/DI for source/destination, CX as shift count/loop count,
DX:AX as the accumulator, BX for any needed table lookup or indexing,
and BP for stack frames, there's nothing left to do. :-)
>Vladimir Vassilevsky wrote:
>> 2. The non-orthogonal set of registers makes the code optimization a
>> non-tractable problem.
>
>Not at all: 16-bit x86 code has so many registers statically allocated
>that a compiler pretty much has all those decisions fixed up front.
>
>I.e. with SI/DI for source/destination, CX as shift count/loop count,
>DX:AX as the accumulator, BX for any needed table lookup or indexing,
>and BP for stack frames, there's nothing left to do. :-)
Unless you want to do excessive stuff like loops within loops, or
accessing more than one data structure at a time.
Nobody needs that, right?
John
Then you just go to PUSH / POP semantics for nesting.
Some might view x86 as short on registers (I don't anymore).
It makes up for it with _blazing_ fast L1 cache and convenient
[EBP+..] addressing for parms & locals.
-- Robert
Right. Optimization is easy. ;-)
> Some might view x86 as short on registers (I don't anymore).
> It makes up for it with _blazing_ fast L1 cache and convenient
> [EBP+..] addressing for parms & locals.
RISC has both (more registers and blazingly fast cache).
It's barbaric. Intel just applied massive amounts of cmos process
technology and illegal business tactics to a stupid architecture.
Interestingly, all their attempts at better architectures have been
expensive failures. i432, i960, Itanic, ARM.
John
Obviously not.
More seriously, a few years ago the fastest possible code for doing
vector math on an AMD cpu would process everything three times:
1) Load a cache-sized block of data by reading one byte from each cache
line, using integer loads.
2) Process that block with fp operations, writing the results to a fixed
(cache-resident) buffer.
3) Copy the temp results to the target array using MMX Non-Temporal
moves, avoiding any cache pollution caused by the otherwise needed
read-for-ownership memory accesses.
One of the (but not the most important) reasons this was so fast was
that each individual loop would actually fit nicely within the 7-8
available registers!
There is some pretty good microarchitecture in there too. Don't
forget, Intel isn't the only one to apply some pretty impressive
lipstick to the x86 pig. Many better than that of Intel.
>Interestingly, all their attempts at better architectures have been
>expensive failures. i432, i960, Itanic, ARM.
Because they can.
Itanic was part of your "illegal business tactics" (and "stupid
architecture" ;-).
>> 2. The non-orthogonal set of registers makes the code optimization a
>> non-tractable problem.
>
> Not at all: 16-bit x86 code has so many registers statically allocated
> that a compiler pretty much has all those decisions fixed up front.
>
> I.e. with SI/DI for source/destination, CX as shift count/loop count,
> DX:AX as the accumulator, BX for any needed table lookup or indexing,
> and BP for stack frames, there's nothing left to do. :-)
That's makes naive code generation easy, but it also makes optimisation
really hard. Optimisation means using all of the registers, not just the
"right" ones. Highly optimised code often uses -fomit-frame-pointer, to
allow EBP to be used as a general-purpose register. Needless to say, that
makes accessing parameters and local variables rather ugly.
>> It is unfortunate that we are stuck going down the x86 path but that
>> is the one where everything seems to have gone.
>
> ISTM that most of the arguments against the X86 come down to
> "aesthetics". While I agree that it is ugly, it has shown itself to be
> capable of very high performance and extensibility (new instructions,
> wider addressability, new addressing modes, etc). You can do low power,
> but perhaps not as low as a different architecture 64 bit chip could be,
> but I suspect the difference would be modest.
>
> So other than aesthetics, what is wrong with X86?
Aesthetics is the wrong term, as it implies something without any
impact upon functionality.
The x86's architectural ugliness means that a great deal of inefficiency
is involved in getting the current levels of performance. A RISC chip with
comparable performance would require far less silicon and far less power.
Not much. ESP is then used as a frame pointer. The instructions
get a couple of bytes longer and the offsets slightly less readable.
"Premature optimization is the root of all evil" [Knuth].
Also, optimization is not what it used to be. The cost of register
spills has gone down while the cost of mispredicted branches has
gone _way_ up. Processors have not gotten uniformly faster.
-- Robert
...hence the reason you don't see traditional x86 CPUs in cell phones, PDAs,
etc...
Joel Koltner wrote:
...yet the newer processors don't offer any significant breakthrough in
the computing performance compared to the x86s.
Robert Redelmeier wrote:
> The cost of register
> spills has gone down while the cost of mispredicted branches has
> gone _way_ up. Processors have not gotten uniformly faster.
And the mispredicted branches are so expensive because of the huge
pipeline required to process the x86 instructions.
I question the use of "far". Others here have said the overhead of
decoding the X86 instructions as a few percent of the total logic.
Besides, on a current desktop or server chip, the overwhelming part of
the silicon is taken up with cache, not CPU logic. So I suspect that
there would be some savings in logic and power, I don't think it would
be "far". And there is some countervailing effect of the smaller
instructions meaning more instructions in a given size I cache, so
perhaps a higher hit rate. I suspect this effect is small, but it is
something.
Nonsense. All modern bleeding edge processors have long pipes.
x86 has little to do with it.
At their cores all the x86 CPUs are now RISC anyway -- they're just surrounded
by circuitry that breaks apart the x86 instructions into "micro operations."
Hence, yeah, for the ultimate in performance, RISC is not really any better
than x86 (and whatever inherent performance advantage a native RISC design
might have it probably offset by Intel's excellent manufacturing/die shrinking
abilities). They do still suffer a bit of a power penalty though, but even
there Intel is aware of their failing nad will steer you towards the Atom CPU
which is quite respectable when it comes to performance per watt. (I wonder
if we will see Atom-based phones and PDAs?)
I almost hate to say it, but for all the failed projects that Intel has had,
they've done a much better job at continuing to manage and evolve their core
x86 product line than most other companies, e.g., Microsoft and Windows. Even
their marketing campaigns, while sometimes completely absurd ("the Internet
was designed to run on Intel processors" -- say what!? When the Internet
began I suspect the only Intel CPUs in use were inside of the keyboards hooked
up to the mainframes and workstations!?), have been effective.
---Joel
In the US alone, several gigawatt-level power plants are working 24/7
to overcome Intel's and Microsofts crappy designs.
More people would use "suspend" if it worked. More people would turn
off computers if they booted up quicker.
John
Actually, "nobody" has a point. Architectural ugliness has very little
to do with the instruction set and a great deal to do with the basic
computational model. However, in this respect, many "RISC" designs
are as ugly as the x86 :-(
Take, for example, floating-point and page table management (TLBs).
A well-designed architecture ensures that functionally separate
instructions can be executed independently. But almost all of them
fail to carry that through to interrupt handling, so the first TLB
miss or floating-point exception/fixup causes the pipeline to glitch!
Or doesn't, and causes the FLIH to have the most disgusting hacks to
cover that up, and that STILL leaves a race condition that can cause
serious problems!
My understanding is that a lot of the logic is concerned with trying
to combine aggressive pipelining/parallelism, while still ensuring
that such problems don't cause chaos. Interrupts are just an extreme
case, and there are a zillion others in most architectures, often
at a much lower level.
>Besides, on a current desktop or server chip, the overwhelming part of
>the silicon is taken up with cache, not CPU logic. ...
Yes. But the x86 is bad there - not as bad as most "RISC" systems,
true. A good design could probably cut the cache requirement very
considerably - or make the current amounts more effective.
Regards,
Nick Maclaren.
In 32-bit mode nearly all the important restrictions were removed,
making the cpu a lot more orthogonal.
They do; per unit of power consumed. Another roadblock: power
consumption at high clock frequencies. Put these two together, go back
a few years and make a roadmap: multi-core architectures emerge.
"Suspend" used to work on my machine. Now it works if I turn off
the DSL modem.
Grrrrins,
James Arthur
The CISC-to-RISC decoder consumes a negligible fraction of the silicon
and power in modern x86 chips. It's the register renaming, out-of-order
execution, and on-die cache that consume the majority of the silicon and
power these days on all high (single-threaded) performance chips --
regardless of the ISA.
x86 isn't the liability that you think it is.
S
--
Stephen Sprunk "Stupid people surround themselves with smart
CCIE #3723 people. Smart people surround themselves with
K5SSS smart people who disagree with them." --Isaac Jaffe
"Suspend" works fine for me. Most of the time, though, I "hibernate" my
machine rather than turning it off (in fact, it's been at least two
years since I rebooted or shut down my machine other than when Windows
Update forces me to), though some genius at Microsoft apparently decided
"hibernate" should be disabled by default, so you have to go digging
into system menus to enable it. They could also put in some improved
logic that would switch from sleeping to hibernating after a period of
inactivity, which would make both features easier to use and more
user-friendly. And, of course, we would save untold amounts of power if
the default screen saver configuration was set to sleep after an hour or
two instead of bounce a Windows logo around the screen for days on end...
Well, there were the Nokia 9000 and 9110 Communicators.
Phil
--
I tried the Vista speech recognition by running the tutorial. I was
amazed, it was awesome, recognised every word I said. Then I said the
wrong word ... and it typed the right one. It was actually just
detecting a sound and printing the expected word! -- pbhj on /.
>> The x86's architectural ugliness means that a great deal of inefficiency
>> is involved in getting the current levels of performance. A RISC chip with
>> comparable performance would require far less silicon and far less power.
>
> I question the use of "far". Others here have said the overhead of
> decoding the X86 instructions as a few percent of the total logic.
> Besides, on a current desktop or server chip, the overwhelming part of
> the silicon is taken up with cache, not CPU logic.
Cache doesn't consume anywhere near as much power as the "active" parts of
the CPU.
Nobody wrote:
On this topic, I see many statements like "as much", "far less",
"overwhelming" and so on. Those adjectives mean nothing.
Can anyone back up his point with the particular facts, figures and
quotations of the sources of the information?
>
>
>Nobody wrote:
>
>> On Wed, 11 Mar 2009 19:02:25 +0000, Stephen Fuld wrote:
>>
>>
>>>>The x86's architectural ugliness means that a great deal of inefficiency
>>>>is involved in getting the current levels of performance. A RISC chip with
>>>>comparable performance would require far less silicon and far less power.
>>>
>>>I question the use of "far". Others here have said the overhead of
>>>decoding the X86 instructions as a few percent of the total logic.
>>>Besides, on a current desktop or server chip, the overwhelming part of
>>>the silicon is taken up with cache, not CPU logic.
>>
>>
>> Cache doesn't consume anywhere near as much power as the "active" parts of
>> the CPU.
>
>On this topic, I see many statements like "as much", "far less",
>"overwhelming" and so on. Those adjectives mean nothing.
Nevertheless, he's right. Caches draw next to nothing, per unit area.
Remember it's STATIC. Nothing is switching, other than the line
currently being accessed. Leakage is far more.
>Can anyone back up his point with the particular facts, figures and
>quotations of the sources of the information?
No one is going to give specifics publicly, but he's right. Just
think about it.
My EEPC boots quickly :>
>
> John
Actually, the decoders in Opteron (pre Barcelona) (all 7 of them) are
smaller than a single 4KByte chunk of SRAM. 4 are one-byte decoders
used when the predecode information is not present, and the other 3
are the multi-byte instruction at a time superscalar decoders.
The out-of-order stuff (reservation-stations/reorder-buffer/future-
file/LS1/2) are several times larger than all the computation circuits
put together (like 5X)
The branch predictor and associated circuitry is larger than all the
computational circuitry put together.
Take the pipeline flip-flops out of all the conputational circuitry
(int, mem, float), and the total area of the computaonal ciruts is
smaller than 4KB of SRAM. Leave the pipeline flip-flops in and the
computation circuitry is still less than 8KBytes of SRAM.
> x86 isn't the liability that you think it is.
New instruction idiom recognition decoders are even converting MOV + 2-
op instructions into 3-op instructions so as to execute them in a
single cycle; compare+branch is done similarly, and a few others.
x86 (the instruction set) is not as hard to decode as is SPARC V9+VIS
(and whatever they may have done to it over the last 9 years).
x86 is not any liability whatsoever (excepting perhaps the legal
chalenges that might be brought forth).
Mitch
> The out-of-order stuff (reservation-stations/reorder-buffer/future-
> file/LS1/2) are several times larger than all the computation circuits
> put together (like 5X)
[...]
> Take the pipeline flip-flops out of all the conputational circuitry
> (int, mem, float), and the total area of the computaonal ciruts is
> smaller than 4KB of SRAM. Leave the pipeline flip-flops in and the
> computation circuitry is still less than 8KBytes of SRAM.
[...]
> x86 is not any liability whatsoever (excepting perhaps the legal
> chalenges that might be brought forth).
What about the semantic complexity of the x86 ISA?
I suppose all of the trickier instructions get delegated to microcode,
but isn't there a cost imposed by segmentation or the density of loads
and stores?
I know the CPUs can optimize for the "flat" segment model, and bypass
the logic for bounds checking and adding base addresses. But just
because you can bypass the complex logic doesn't mean it was free.
Perhaps I overestimate how those costs add up, but it seems like
circuitry that must be located ~1 clock cycle's wire delay from the
load/store units occupies some prime real estate. Especially
considering that you're hoping to never use it!
Similarly, I thought having an extra source of faults (segment limit
check violation) contributes to the complexity of the out-of-order
stuff. Those issues would be compounded by the load/store density of
x86/x64 code - those have to be fast paths.
Out of curiosity, is the 4/8KB SRAMs you mentioned for size comparison
the vanilla single-ported variety? Would that be including the vector
unit?
-Eric
One of the things that the IA64 got right in principle and wrong in
practice was to try to simplify that area by making it more explicit.
I still think that could be done - but not that way!
>The branch predictor and associated circuitry is larger than all the
>computational circuitry put together.
Indeed? It is, of course, a computationally intractable (in the CS
sense) task.
>Take the pipeline flip-flops out of all the conputational circuitry
>(int, mem, float), and the total area of the computaonal ciruts is
>smaller than 4KB of SRAM. Leave the pipeline flip-flops in and the
>computation circuitry is still less than 8KBytes of SRAM.
Including a full, glorious, optimised IEEE 754 unit? Boggle. If one
adds full support for denormalised numbers, exceptional results and
(heaven help us) decimal floating-point, that will clearly go up,
but not by a huge factor.
>x86 (the instruction set) is not as hard to decode as is SPARC V9+VIS
>(and whatever they may have done to it over the last 9 years).
That fails to surprise me! I have always been a supporter of RISC,
the principle, and very unimpressed with RISC, the dogma.
>x86 is not any liability whatsoever (excepting perhaps the legal
>chalenges that might be brought forth).
Grrk. Now, THERE I disagree. It's extremely unclear how to extend
it to allow for scalable parallelism, except by the tried (and not
very successful) heavyweight threading approach. Of course, the
same remark applies to all of the current 'RISCs' ....
Regards,
Nick Maclaren.
> In the US alone, several gigawatt-level power plants are working 24/7
> to overcome Intel's and Microsofts crappy designs.
>
> More people would use "suspend" if it worked. More people would turn
> off computers if they booted up quicker.
Maybe the electricity is cheaper than redesigning the kit? Maybe its the
customer who pays for the electricity and the vendor who pays for the
redesign? Maybe if that electricity came from a nice green, sustainable
nuclear plant it wouldn't actually matter? Maybe it doesn't matter anyway?
And the good old HP 200LX I play with. 80186, with excellent
battery life. And mine has a FORTRAN compiler to boot.
Cheers,
Steve N.
Suspend *does* work at least on properly configured computers. There are
a few rogue hardware designs that have typically USB peripherals or
other drivers that do not handle suspend gracefully but that isn't the
CPU's fault. Most of the problems reside in buggy 'Doze device drivers.
I have a year old Toshiba portable supplied with Vista that is barely
usable - it regularly disables its own keyboard and its on-off switch
(whatever setting of power save are used). Works OK on XP or even
Win98SE so it is a Vista fault with too-clever-by-half hardware drivers.
Regards,
Martin Brown
OK, I should have said, "the vast majority of cell phones/PDAs."
The 200LX was a pretty neat machine -- I once worked at a place where they
were used as data collection terminals, and found them surprisingly usable.
HP took the idea that Atari had come up with in their Portfolio and really
made it usable (ok, they started with the 100, but even that was already a big
improvement).
There sure were a lot of interesting, weird machines coming out back in the
'80s... you just don't see that these days, now that everyone expects a
PDA/PC/etc. to immediately have a full-featured web browser, e-mail, etc., so
you're pretty much tied to an existing architecture and operating system and
only get to make evolutionary changes to the platform.
I'm not sure if the equivalent on a 200LX today is something like a Nokia N810
(completely open, hackable Linux machine, modest-if-not-spectacular software
provided by Nokia) or an iPod Touch (relatively closed machine, hacking
strongly discouraged by Apple, but lots of pretty impressive software provided
by them too).
---Joel
>>>>The x86's architectural ugliness means that a great deal of inefficiency
>>>>is involved in getting the current levels of performance. A RISC chip with
>>>>comparable performance would require far less silicon and far less power.
>>>
>>>I question the use of "far". Others here have said the overhead of
>>>decoding the X86 instructions as a few percent of the total logic.
>>>Besides, on a current desktop or server chip, the overwhelming part of
>>>the silicon is taken up with cache, not CPU logic.
>>
>> Cache doesn't consume anywhere near as much power as the "active" parts of
>> the CPU.
>
> On this topic, I see many statements like "as much", "far less",
> "overwhelming" and so on. Those adjectives mean nothing.
>
> Can anyone back up his point with the particular facts, figures and
> quotations of the sources of the information?
I doubt it; if this information is available at all, it would almost
certainly require an NDA.
But it's no secret that the power consumption for digital logic is
dominated by energy-per-transition rather than quiescent current.
Careful. Don't spread it around. The AGW crowd will want to tax each
transition :-(
...Jim Thompson
--
| James E.Thompson, P.E. | mens |
| Analog Innovations, Inc. | et |
| Analog/Mixed-Signal ASIC's and Discrete Systems | manus |
| Phoenix, Arizona 85048 Skype: Contacts Only | |
| Voice:(480)460-2350 Fax: Available upon request | Brass Rat |
| E-mail Icon at http://www.analog-innovations.com | 1962 |
Lord protect me from queers, fairies and Democrats
Please! The last one will be taxing enough.
http://townhall.com/Columnists/DanKennedy/2009/03/11/but_what_if_th
e_rich_refuse_to_be_eaten)
Bumper sticker: "Don't buy until he's gone"
Nobody wrote:
> On Wed, 11 Mar 2009 18:41:49 -0500, Vladimir Vassilevsky wrote:
>
>
>>>>>The x86's architectural ugliness means that a great deal of inefficiency
>>>>>is involved in getting the current levels of performance. A RISC chip with
>>>>>comparable performance would require far less silicon and far less power.
>>>>
>>>>I question the use of "far". Others here have said the overhead of
>>>>decoding the X86 instructions as a few percent of the total logic.
>>>>Besides, on a current desktop or server chip, the overwhelming part of
>>>>the silicon is taken up with cache, not CPU logic.
>>>
>>>Cache doesn't consume anywhere near as much power as the "active" parts of
>>>the CPU.
>>
>>On this topic, I see many statements like "as much", "far less",
>>"overwhelming" and so on. Those adjectives mean nothing.
>>
>>Can anyone back up his point with the particular facts, figures and
>>quotations of the sources of the information?
>
>
> I doubt it; if this information is available at all, it would almost
> certainly require an NDA.
:)))))))
Nobody knows anything but everybody has the invaluable opinion. That's
the essense of the leftism - weenism.
> But it's no secret that the power consumption for digital logic is
> dominated by energy-per-transition rather than quiescent current.
That's true for 3.3V, it depends for 1.5V, and it is pretty much not
true for below one volt high speed logic. Consider the fanout and the
stray capacitance as well.
I *know*. You won't listen. *You* are the essence of weenieism.
> > But it's no secret that the power consumption for digital logic is
> > dominated by energy-per-transition rather than quiescent current.
>
> That's true for 3.3V, it depends for 1.5V, and it is pretty much not
> true for below one volt high speed logic. Consider the fanout and the
> stray capacitance as well.
Absolute horseshit.
>In article <zDdul.22113$Ws1....@nlpi064.nbdc.sbc.com>,
>antispa...@hotmail.com says...>
[snip]
>>
>> Nobody knows anything but everybody has the invaluable opinion. That's
>> the essense of the leftism - weenism.
>
>I *know*. You won't listen. *You* are the essence of weenieism.
>
>> > But it's no secret that the power consumption for digital logic is
>> > dominated by energy-per-transition rather than quiescent current.
>>
>> That's true for 3.3V, it depends for 1.5V, and it is pretty much not
>> true for below one volt high speed logic. Consider the fanout and the
>> stray capacitance as well.
>
>Absolute horseshit.
>
Don't you just love all our resident "experts"?
But I notice our OP is "hotmail", so I would have never noticed,
except for you feeding the troll ;-)
...Jim Thompson
--
| James E.Thompson, P.E. | mens |
| Analog Innovations, Inc. | et |
| Analog/Mixed-Signal ASIC's and Discrete Systems | manus |
| Phoenix, Arizona 85048 Skype: Contacts Only | |
| Voice:(480)460-2350 Fax: Available upon request | Brass Rat |
| E-mail Icon at http://www.analog-innovations.com | 1962 |
It's what you learn, after you know it all, that counts.
Expert == "Has-been drip under pressure"
Vlad has never been.
> But I notice our OP is "hotmail", so I would have never noticed,
> except for you feeding the troll ;-)
Prince Vlad? Troll? No!?
>
> One of the things that the IA64 got right in principle and wrong in
> practice was to try to simplify that area by making it more explicit.
> I still think that could be done - but not that way!
>
Maybe not. Maybe, as someone said to me that he had observed very
early in the game, the compiler is just too far from the action.
Zillions of transistors committed to OoO, branch prediction,
speculative execution, etc. are plenty close to the action, but, now
we have to worry about the fact that they eat power.
It was the failure of Dynamo-RIO and of Transmeta that puzzled me.
Why is this so fundamental? Why is it either Terje (or equivalent),
zillions of transistors burning watts, or live with it?
Surely it must be possible to have *something* scope what's actually
happening and respond appropriately... or is it that the computational
task is roughly the same as building a machine to pass the Turing
test?
Robert.
That's why I said what I said. They forgot that the architecture
is a protocol to be used by the compiler to communicate the semantics
of the program to the hardware, and not a set of laws for the compiler
to fit itself into.
I have posted in the past what I think of better approaches, and
mostof the hardware people seem to agree that they would be easy
to implement. They wouldn't be hard to compile, either - from the
right sort of language (e.g. including Haskell, the better class of
Fortran program, but definitely not C and C++)! Whether they are
as effective as I think they might be is less clear.
Unrealistic? Perhaps. But what if you could get 10 times the
performance for 1/4 the power consumption? Wouldn't that be worth
a revolution?
Regards,
Nick Maclaren.
That's the essence of any disciple, leftist or "right"-ist. "The Poobah
says it, I believe it, that settles it!"
Unfortunately, they then vote for the poobah of their choice, tweedledumb
or tweedleduh.
Sigh.
Rich
>> But it's no secret that the power consumption for digital logic is
>> dominated by energy-per-transition rather than quiescent current.
>
> That's true for 3.3V, it depends for 1.5V, and it is pretty much not
> true for below one volt high speed logic. Consider the fanout and the
> stray capacitance as well.
How does stray capacitance increase the current drawn by a stable circuit?
If anything, it's going to increase the energy required for transitions
(I=C.dV/dt, increase C => increase I => increase I^2.R).
Vlad the Impaler doesn't have a clue, so he spouts nonsense.
Nobody wrote:
Returning back to the speculations on the power consumption of the
cache. Cache performs the access to all cache lines at every read or
write operation (and the tags, of course), so it is not the idle circuit.
dynamic losses = ~ F x C x U^2/2
Cache occupies large area, many transistors, long wires, many inputs and
outputs, big capacitance, big transistors to drive heavy loads, high
dynamic losses and high static losses as well. To me, it is not obvious
how the power consumption of the cache compares to the other parts of
the CPU.
And what percentage of software in use today is written in a language
other than C or C++ (or a language written on top of one of those)? How
many professional programmers (i.e. not academics) are learning and
using those languages? You'd have to have something amazingly
revolutionary to throw away the collective knowledge of an entire industry.
> Whether they are as effective as I think they might be is less clear.
>
> Unrealistic? Perhaps. But what if you could get 10 times the
> performance for 1/4 the power consumption? Wouldn't that be worth
> a revolution?
Perhaps -- and there _are_ other architectures that have been successful
in the embedded space, though none that provide the kind of benefits you
propose. However, for the desktop/laptop/server market, your chip would
have to be able to emulate a Wintel system and run existing software at
least as fast as the best x86 chip, and aside from the Alpha's few
months in the sun, nobody has managed to do that, and so we're all stuck
with x86 -- and every year that condition persists, we lock ourselves in
even more.
Worse, I'm not even sure that "10 times the performance for 1/4 the
power consumption" is enough to motivate most people to switch; that's
only a few years of Moore's Law -- probably less time than it'd take the
industry to learn your new system, buy and deploy the machines, etc. Do
you have a roadmap to how you'd continue to improve the performance of
your chips after the first release? I remember that Itanic was pretty
good when it was first designed, and a lot of companies considered
switching, but by the time they had geared up to do so, x86 had pulled
ahead again and Itanic was sinking in the same place it had been for two
years... PPC had a better run, but it sill lost in the end because it
couldn't keep up with the relentless pace of improvements in x86 chips.
S
--
Stephen Sprunk "Stupid people surround themselves with smart
CCIE #3723 people. Smart people surround themselves with
K5SSS smart people who disagree with them." --Isaac Jaffe
>
>
>Nobody wrote:
>
>> On Thu, 12 Mar 2009 14:45:38 -0500, Vladimir Vassilevsky wrote:
>>
>>
>>>>But it's no secret that the power consumption for digital logic is
>>>>dominated by energy-per-transition rather than quiescent current.
>>>
>>>That's true for 3.3V, it depends for 1.5V, and it is pretty much not
>>>true for below one volt high speed logic. Consider the fanout and the
>>>stray capacitance as well.
>>
>> How does stray capacitance increase the current drawn by a stable circuit?
>> If anything, it's going to increase the energy required for transitions
>> (I=C.dV/dt, increase C => increase I => increase I^2.R).
>
>Returning back to the speculations on the power consumption of the
>cache. Cache performs the access to all cache lines at every read or
>write operation (and the tags, of course), so it is not the idle circuit.
I hope you aren't an engineer. So far you're doing as well as
DimBulb.
>dynamic losses = ~ F x C x U^2/2
>
>Cache occupies large area, many transistors, long wires, many inputs and
>outputs, big capacitance, big transistors to drive heavy loads, high
>dynamic losses and high static losses as well. To me, it is not obvious
>how the power consumption of the cache compares to the other parts of
>the CPU.
Of course it's not obvious to you. You're dumb as a stump.