If I got this right..
XP thinks it's got 2 CPU's.
One thread will only use 50% of CPU mips.
So when I run spice, the CPU is only 1/2 working.
The other half is for whatever...email,browser,OS stuff etc.
I want spice to at full speed.
Is there a way to set XP such that only my spice app gets nearly full
CPU attention?
When the spice app is not running, the computer goes back to 50 50
hyperthreading.
Note: I do have the option in bios to disengage hyperthreading.
I might do that if hyperthreading is being a pita.
D from BC
myrealaddress(at)comic(dot)com
British Columbia
Canada
Hello,
You can get what you asked for if you use the latest LTspiceIV.
It will stress all your CPU-cores to its limits. Let's see when the first
people will claim LTspice has burned their Quadcore-CPUs because
of running long lasting simulation runs. :-)
None of the other SPICE programs you can afford will use more
than one CPU-core.
Best regards,
Helmut
Great!
I checked my LTspice version ..it's version 2.25d :( Ugh..
I'll get version 4.
Thanks
I wonder if the other SPICE program writers are massively pissed about
LTspice being free.
Perhaps the LT president and LTspice writer get all 4 tires spiked a
little too often :P
But I don't think it's that gangsterlike in technology.. :)
I'm running LTspice4 now. Confirmed using help/about.
However. in Windows task manager scad3.exe is running.. huh...
Anyways..
My spice is not running faster and Windows Task Manager is only
showing CPU usage at 50%. iow...1 virtual core saturated.
I did spot the max treads option under the LTspice control panel.
It's set to 2.
I haven't noticed a change in speed. :(
Not any more than than original free spice. I just downloaded the
source, compiled and spiced.
> I have an old single core puter using hyperthreading with XP.
>
> If I got this right..
> XP thinks it's got 2 CPU's.
> One thread will only use 50% of CPU mips.
> So when I run spice, the CPU is only 1/2 working.
> The other half is for whatever...email,browser,OS stuff etc.
With hyperthreading, a single thread will use the entire CPU in the
absence of any competing threads (Windows reports that it's only 50%, but
it's lying; you can confirm this by running two processes on an
otherwise-idle system and noting that one process runs more slowly as the
other process consumes more CPU).
If you have a CPU with two full cores (Pentium-M), the only way that you
can utilise both cores is with multi-threaded code. You can't make the CPU
behave as a single, faster core.
>On Tue, 20 Jan 2009 21:13:44 +0100, "Helmut Sennewald"
><helmuts...@t-online.de> wrote:
>
[snip]
>>
>>You can get what you asked for if you use the latest LTspiceIV.
>>It will stress all your CPU-cores to its limits. Let's see when the first
>>people will claim LTspice has burned their Quadcore-CPUs because
>>of running long lasting simulation runs. :-)
>>
>>None of the other SPICE programs you can afford will use more
>>than one CPU-core.
>>
>>Best regards,
>>Helmut
>>
>
>I wonder if the other SPICE program writers are massively pissed about
>LTspice being free.
No. Engineers don't choose simulators, jerks in management do...
that's why Cadence sales are so high, even though, IMNSHO, it's a
royal piece-a-crap, grossly user unfriendly and butt slow.
[snip]
...Jim Thompson
--
| James E.Thompson, P.E. | mens |
| Analog Innovations, Inc. | et |
| Analog/Mixed-Signal ASIC's and Discrete Systems | manus |
| Phoenix, Arizona 85048 Skype: Contacts Only | |
| Voice:(480)460-2350 Fax: Available upon request | Brass Rat |
| E-mail Icon at http://www.analog-innovations.com | 1962 |
I love to cook with wine Sometimes I even put it in the food
Hello,
I just tried a SMPS simulation.
LTspiceIV has a caused a CPU usage of 100% on my dual core CPU on Win-XP.
The gain regarding simulation speed depends on the circuit size.
Small circuits will not gain much in speed. Large circuits will
run up to 3 times faster on a quad core CPU compared to a
single core CPU with the same clock frequency.
Best regards,
Helmut
That's because with hyperthreaded CPUs, "50% utilization" of "one CPU" is
still using (nearly) 100% of the total clock cycles available if nothing else
is running on the "second" CPU.
In other words... you get 1 billion operations per second (or whatever).
Hyperthreaded CPUs just give the appearance of two CPUs so that if a
particular thread is waiting on, e.g., a memory read from DRAM (this can take
hundreds of cycles), another thread can run in those hundreds of cycles.
However, if the thread is never waiting on anything, it will consume all 1
billion cycles every second.
A 1GHz hyperthreaded CPU can compute pi no faster than a non-hyperthreaded
CPU. (Nor will it be any slow.) But hyperthreaded is a win when you have
multiple *different* threads running around when one starts having to wait and
would otherwise just be wasting CPU cycles...
Oh, I think they're a little more ticked than before -- LTspice has many, many
improvements over Berkeley SPICE, plus it has a decent (if simplistic)
integrated graphing/probing environment, that had to be provided by various
(not as tightly-integrated) add-ons with the Berkeley SPICE.
And amazingly, sometimes those jerks haven't even used SPICE in years (since,
e.g., college)... and possibly not ever!
>On Tue, 20 Jan 2009 11:44:07 -0800, D from BC wrote:
>
>> I have an old single core puter using hyperthreading with XP.
>>
>> If I got this right..
>> XP thinks it's got 2 CPU's.
>> One thread will only use 50% of CPU mips.
>> So when I run spice, the CPU is only 1/2 working.
>> The other half is for whatever...email,browser,OS stuff etc.
>
>With hyperthreading, a single thread will use the entire CPU in the
>absence of any competing threads (Windows reports that it's only 50%, but
>it's lying; you can confirm this by running two processes on an
>otherwise-idle system and noting that one process runs more slowly as the
>other process consumes more CPU).
So maybe I have LTspice running at nearly full blast with my single
core yet 2 virtual core ye old hyperthreading Pentium 4 Prescott.
>
>If you have a CPU with two full cores (Pentium-M), the only way that you
>can utilise both cores is with multi-threaded code. You can't make the CPU
>behave as a single, faster core.
D from BC
myrealaddress(at)comic(dot)com
British Columbia
Canada
"D from BC" <myreal...@comic.com> wrote in message
news:paicn4hriphp1sgi3...@4ax.com...
Try to disable Hyperthreading in the BIOS. In rare cases it will
actually slow things down, like in SQL server
Cheers
After reading replies, I did this experiment..
Ran the simulation and looked at the simulation run speed at the
bottom of the LTspice window.
LTspice reports simulation speed ~27us/s
Windows shows a total CPU usage at 50%
Spawned another instance of LTspice and run the same simulation at the
same time..
LTspice reports simulation speed ~15us/s
Windows shows 100% CPU usage..
2 threads, one core, so the hyperthreading splits up the work.
1 thread one core, the hyperthreading lets the thread use up all the
mips as long as there's no completing threads.
Thanks for replies...That clears that up.
As the kids say: THIS.
So much THIS.
None of my other *programs* use more than one CPU core (that I have
noticed).
And that's using Wine on linux! It was the first time I ever saw
"400%" CPU usage. Good job Mike.
--
John Devereux
>
>I wonder if the other SPICE program writers are massively pissed about
>LTspice being free.
>
I don't think so. Ltspice is not very user friendly IMHO.
--
Failure does not prove something is impossible, failure simply
indicates you are not using the right tools...
"If it doesn't fit, use a bigger hammer!"
--------------------------------------------------------------
Compared to entering netlists by hand it's quite friendly. :-)
>D from BC <myreal...@comic.com> wrote:
>
>>
>>I wonder if the other SPICE program writers are massively pissed about
>>LTspice being free.
>>
>
>I don't think so. Ltspice is not very user friendly IMHO.
Compared to what program?
'Stupid LTSpice program doesn't have a 'Read Mind Make Circuit'
button.' :P
I wish LTspice would accept the old original MicroSim Schematics as a
front-end.
LTspice schematic capture is rather juvenile ;-)
>On Wed, 21 Jan 2009 21:24:16 GMT, ni...@puntnl.niks (Nico Coesel)
>wrote:
>
>>D from BC <myreal...@comic.com> wrote:
>>
>>>
>>>I wonder if the other SPICE program writers are massively pissed about
>>>LTspice being free.
>>>
>>
>>I don't think so. Ltspice is not very user friendly IMHO.
>
>Compared to what program?
>
>'Stupid LTSpice program doesn't have a 'Read Mind Make Circuit'
>button.' :P
>
>
>D from BC
That would be nice, particularly if there was a "Correct Wrong
Connections" button ;-)
In the old days we used an IBM 029 to enter the netlists.
I used to draw schematics with pencil-on-paper, number the nodes, then
type up the netlist and Spice commands, then run on a VAX-780 ;-)
Well, the LTspice schematic format is well-known/documented... did MicroSim
ever publicly document the format of their schematic files? If so then it
shouldn't be too difficult to write a converter... if not finding someone to
voluntarily reverse-engineer the MicroSim schematic format is probably
unlikely...?
> LTspice schematic capture is rather juvenile ;-)
Yeah, it is, but you have to remember it's the work of exactly one guy who
other duties besides just working on new features/bug fixes for it.
---Joel
I know... I was just "funning" ;-)
I can generate a netlist/cir file using my schematic capture then run
in LTspice, but I'd have to enter the node names by hand in the
post-processor.
I never had to use an 029, but I did have to wrap JCL around the
netlist to run the sim jobs on the mainframe. Text entry was via
2741s (ruggedized communicating Selectric typewriters) and output via
line/chain printers.
>On Wed, 21 Jan 2009 21:24:16 GMT, ni...@puntnl.niks (Nico Coesel)
>wrote:
>
>>D from BC <myreal...@comic.com> wrote:
>>
>>>
>>>I wonder if the other SPICE program writers are massively pissed about
>>>LTspice being free.
>>>
>>
>>I don't think so. Ltspice is not very user friendly IMHO.
>
>Compared to what program?
I like Orcad pspice. The old microsim pspice also had a good
schematics entry. If LTspice had a better user interface it would be a
killer.
Hello Nico,
I bet that I am faster drawing a schematic for simulation than you
or I will do in PSPICE/ORCAD or any other SPICE program.
LTspice schematic entry is optimized for SPICE schematics.
PSPICE/ORCAD schematic entry is designed for PCB layout
requirements which means extra burden if you only want enter
a schematic for SPICE. I like the LTspice schematic entry,
because it's fast and dedicated for (LT-)SPICE.
Best regards,
Helmut
>D from BC <myreal...@comic.com> wrote:
>
>>On Wed, 21 Jan 2009 21:24:16 GMT, ni...@puntnl.niks (Nico Coesel)
>>wrote:
>>
>>>D from BC <myreal...@comic.com> wrote:
>>>
>>>>
>>>>I wonder if the other SPICE program writers are massively pissed about
>>>>LTspice being free.
>>>>
>>>
>>>I don't think so. Ltspice is not very user friendly IMHO.
>>
>>Compared to what program?
>
>I like Orcad pspice.
Capture sucks!
>The old microsim pspice also had a good
>schematics entry.
I still use it. Love it!
>If LTspice had a better user interface it would be a
>killer.
Yep!
Still wouldn't help Sloman. It would still display "INSUFFICIENT
DATA, TRY GAIN".
--
http://improve-usenet.org/index.html
aioe.org, Goggle Groups, and Web TV users must request to be white
listed, or I will not see your messages.
If you have broadband, your ISP may have a NNTP news server included in
your account: http://www.usenettools.net/ISP.htm
There are two kinds of people on this earth:
The crazy, and the insane.
The first sign of insanity is denying that you're crazy.
LTspice schematics entry is just a little unusual, when you come
from MicroSim Pspice. Now that I'm used to it, I do indeed find it
better. Much better. It's all a matter of personal taste, I suppose.
Jeroen Belleman
>"D from BC" <myreal...@comic.com> wrote in message
>news:paicn4hriphp1sgi3...@4ax.com...
>> I did spot the max treads option under the LTspice control panel.
>> It's set to 2.
>> I haven't noticed a change in speed. :(
>
>That's because with hyperthreaded CPUs, "50% utilization" of "one CPU" is
>still using (nearly) 100% of the total clock cycles available if nothing else
>is running on the "second" CPU.
>
>In other words... you get 1 billion operations per second (or whatever).
>Hyperthreaded CPUs just give the appearance of two CPUs so that if a
>particular thread is waiting on, e.g., a memory read from DRAM (this can take
>hundreds of cycles)
Memory access taking hundreds of cycles? Hell not even a dozen.
Task switches take hundreds of clocks assuming no cache misses.
>, another thread can run in those hundreds of cycles.
What kills many applications is disk I/O, by taking many tens of
thousands to a few millions of clock cycles waiting to complete.
Hyperthreading uses this time in multitasking situations to get more
done. There has to be something to be done in order to see any
improvement.
>However, if the thread is never waiting on anything, it will consume all 1
>billion cycles every second.
>
>A 1GHz hyperthreaded CPU can compute pi no faster than a non-hyperthreaded
>CPU. (Nor will it be any slow.) But hyperthreaded is a win when you have
>multiple *different* threads running around when one starts having to wait and
>would otherwise just be wasting CPU cycles...
>
Your guesses were so far off as to be laughable.
You must be thinking of 20 years ago.
When you say line/chain printers are specifically limiting to chain
printers or are you including drum printers? (or even other
technologies)
I don't think anyone is really bitching about LTSpice schematic
capture much. It could easily be better, like for instance add a
library editor / component creator. Persuade several passive part
providers to create libraries (kind of dependent on component creator
function). I think asking for hierarchical design would be a but much
though.
Hello,
LTspice has the feature of hierarchical schematics and you
can probe in the lower level schematic of course.
Best regards,
Helmut
Chain printers (1403s). They were soon replaced with laser printers
(3800s) though.
>>In other words... you get 1 billion operations per second (or whatever).
>>Hyperthreaded CPUs just give the appearance of two CPUs so that if a
>>particular thread is waiting on, e.g., a memory read from DRAM (this can take
>>hundreds of cycles)
>
> Memory access taking hundreds of cycles? Hell not even a dozen.
It depends how fast your RAM is. At one point (I guess around 5 years
ago), 350 CPU cycles for a code cache miss was not atypical, but RAM speed
has been consistently increasing faster than CPU speed for the last few
years.
> Task switches take hundreds of clocks assuming no cache misses.
>
>>, another thread can run in those hundreds of cycles.
>
> What kills many applications is disk I/O, by taking many tens of
> thousands to a few millions of clock cycles waiting to complete.
>
> Hyperthreading uses this time in multitasking situations to get more
> done. There has to be something to be done in order to see any
> improvement.
For a virtual memory cache miss, where you're waiting on disc access,
hyperthreading doesn't help. The OS kernel will just do a full task
switch (assuming that there's another process/thread to switch to).
it's possible to add components (to mine I added a bunch of common NXP
transistors, 7805, and TL471 using 3rd-party models)
LT is not going to make it easy for NS, TI or OnSemi to produce models
of their switchers and op-amps, LTSpice is after-all a marketing tool.
what is missing is a way to create a new classes of component
(examples of classes are NPN PNP NCH PCH etc... you pick the class and
then get a list of models)
>>>In other words... you get 1 billion operations per second (or whatever).
>>>Hyperthreaded CPUs just give the appearance of two CPUs so that if a
>>>particular thread is waiting on, e.g., a memory read from DRAM (this
>>>can take hundreds of cycles)
>>
>> Memory access taking hundreds of cycles? Hell not even a dozen.
>
> It depends how fast your RAM is. At one point (I guess around 5 years
> ago), 350 CPU cycles for a code cache miss was not atypical, but RAM
> speed has been consistently increasing faster than CPU speed for the
> last few years.
To remove a possible source of confusion: cycle "costs" take into account
the fact that each core can execute multiple instructions concurrently
(superscalar architecture). So a cost of e.g. "100 cycles" refers to a
delay in which a sequence of instructions totalling 100 cycles could be
executed, not 100 times the CPU clock period.
>On Sat, 24 Jan 2009 02:40:02 -0800, JosephKK wrote:
>
>>>In other words... you get 1 billion operations per second (or whatever).
>>>Hyperthreaded CPUs just give the appearance of two CPUs so that if a
>>>particular thread is waiting on, e.g., a memory read from DRAM (this can take
>>>hundreds of cycles)
>>
>> Memory access taking hundreds of cycles? Hell not even a dozen.
>
>It depends how fast your RAM is. At one point (I guess around 5 years
>ago), 350 CPU cycles for a code cache miss was not atypical, but RAM speed
>has been consistently increasing faster than CPU speed for the last few
>years.
This is the first i have heard of it taking that long.
So you have heard of pipeline bubbling. The pipelines are not that
deep, about 7 stages max due to complexity increases.
Current and recent processors (about 5 years for x86, more for SPARC
and others) support speculative execution and out of order execution
to reduce this problem.
How about that. I never stumbled into how to add components. Please
enlighten me.
I haven't recently had much need for more than i could put on one
sheet. Guess that is why i hadn't found that feature. Of course i
only have LTSpice/SwitcherCad III v 2.21 from October 2007. I must
need to update or something.
>On Sun, 25 Jan 2009 00:06:57 +0000, Nobody <nob...@nowhere.com> wrote:
>
>>On Sat, 24 Jan 2009 19:39:32 +0000, Nobody wrote:
>>
>>>>>In other words... you get 1 billion operations per second (or whatever).
>>>>>Hyperthreaded CPUs just give the appearance of two CPUs so that if a
>>>>>particular thread is waiting on, e.g., a memory read from DRAM (this
>>>>>can take hundreds of cycles)
>>>>
>>>> Memory access taking hundreds of cycles? Hell not even a dozen.
>>>
>>> It depends how fast your RAM is. At one point (I guess around 5 years
>>> ago), 350 CPU cycles for a code cache miss was not atypical, but RAM
>>> speed has been consistently increasing faster than CPU speed for the
>>> last few years.
>>
>>To remove a possible source of confusion: cycle "costs" take into account
>>the fact that each core can execute multiple instructions concurrently
>>(superscalar architecture). So a cost of e.g. "100 cycles" refers to a
>>delay in which a sequence of instructions totalling 100 cycles could be
>>executed, not 100 times the CPU clock period.
>
>So you have heard of pipeline bubbling. The pipelines are not that
>deep, about 7 stages max due to complexity increases.
Depends on the processor. The G5 and P4 were significantly deeper
than that (more like 20 stages). The entire pipe is flushed on a
mispredicted branch or context switch. If the target isn't in the
cache it has to be reloaded from main memory.
>Current and recent processors (about 5 years for x86, more for SPARC
>and others) support speculative execution and out of order execution
>to reduce this problem.
It doesn't reduce the problem, rather makes it occur less often (when
the planets line up). The "100 cycles" is still there. Memory with a
100ns access and a 1GHz CPU kinda makes access 100x clock.
>>> It depends how fast your RAM is. At one point (I guess around 5 years
>>> ago), 350 CPU cycles for a code cache miss was not atypical, but RAM
>>> speed has been consistently increasing faster than CPU speed for the
>>> last few years.
>>
>>To remove a possible source of confusion: cycle "costs" take into account
>>the fact that each core can execute multiple instructions concurrently
>>(superscalar architecture). So a cost of e.g. "100 cycles" refers to a
>>delay in which a sequence of instructions totalling 100 cycles could be
>>executed, not 100 times the CPU clock period.
>
> So you have heard of pipeline bubbling. The pipelines are not that
> deep, about 7 stages max due to complexity increases.
> Current and recent processors (about 5 years for x86, more for SPARC
> and others) support speculative execution and out of order execution
> to reduce this problem.
Indeed. But while that mitigates data cache misses, it doesn't do anything
for a code cache miss.
Especially on x86, where instructions are variable length and so can't
even be decoded if prior instructions are missing, let alone executed.
Even on a RISC architecture, it makes such techniques much less efficient.
If the missing instructions include a branch which will usually be taken,
speculatively executing subsequent instructions is a waste of cycles.
Similarly for an instruction which uses the contents of a register which
is modified by a missing prior instruction.
If anything, it is the reliance upon caching and instruction re-ordering
which makes code cache misses such a performance killer, as they mitigated
the problems with slow RAM to such an extent that there was little
incentive to increase RAM speed.
Although such techniques worked well for "classical" procedural code, they
often worked rather less well for e.g. object-oriented code making heavy
use of virtual functions, or interpreted languages where a substantial
portion of the interpreter can be required for even the simplest functions.
Not so on mispredicted branches. Moreover speculative execution of
both sides almost eliminates the issue. Also that may have been that
much total depth but less than 3% of instructions (and much less than
1 % of execution) need all of them, mostly things like pusha and popa
which move multiple registers onto and off of the stack.
>
>>Current and recent processors (about 5 years for x86, more for SPARC
>>and others) support speculative execution and out of order execution
>>to reduce this problem.
>
>It doesn't reduce the problem, rather makes it occur less often (when
>the planets line up). The "100 cycles" is still there. Memory with a
>100ns access and a 1GHz CPU kinda makes access 100x clock.
Wow, the last time i saw ram with 100 ns access times was back in the
386 days. Even then you could get 70 ns and 60 ns premium parts.
Current stuff is like 12 ns to 15 ns access and 60 ns to 85 ns cycle
times with multiple consecutive address available at 5 ns intervals.
The only place you get killed is on cache writeback block outs, that
does have 100 ns plus lags before reading the new data (but that does
not apply to instruction caches).
>On Sun, 25 Jan 2009 10:59:31 -0800, JosephKK wrote:
>
>>>> It depends how fast your RAM is. At one point (I guess around 5 years
>>>> ago), 350 CPU cycles for a code cache miss was not atypical, but RAM
>>>> speed has been consistently increasing faster than CPU speed for the
>>>> last few years.
>>>
>>>To remove a possible source of confusion: cycle "costs" take into account
>>>the fact that each core can execute multiple instructions concurrently
>>>(superscalar architecture). So a cost of e.g. "100 cycles" refers to a
>>>delay in which a sequence of instructions totalling 100 cycles could be
>>>executed, not 100 times the CPU clock period.
>>
>> So you have heard of pipeline bubbling. The pipelines are not that
>> deep, about 7 stages max due to complexity increases.
>> Current and recent processors (about 5 years for x86, more for SPARC
>> and others) support speculative execution and out of order execution
>> to reduce this problem.
>
>Indeed. But while that mitigates data cache misses, it doesn't do anything
>for a code cache miss.
That does not seem to follow.
>
>Especially on x86, where instructions are variable length and so can't
>even be decoded if prior instructions are missing, let alone executed.
Kindly explain how you get past the previous instruction to begin
decoding the current instruction without decoding the previous
instruction.
>
>Even on a RISC architecture, it makes such techniques much less efficient.
>If the missing instructions include a branch which will usually be taken,
>speculatively executing subsequent instructions is a waste of cycles.
What? If you speculative execute both results of a conditional branch
the pipeline stays full and you just drop the unused results in the
bit bucket. The result: conditional branches without significant
pipeline bubbles.
>Similarly for an instruction which uses the contents of a register which
>is modified by a missing prior instruction.
Ah, you misunderstand the methods and practice of out of order
execution. See the Tomasulo algorithm, and register scoreboarding
(which requires "phantom registers").
>
>If anything, it is the reliance upon caching and instruction re-ordering
>which makes code cache misses such a performance killer, as they mitigated
>the problems with slow RAM to such an extent that there was little
>incentive to increase RAM speed.
That is non-factual and barely coherent.
>
>Although such techniques worked well for "classical" procedural code, they
>often worked rather less well for e.g. object-oriented code making heavy
>use of virtual functions, or interpreted languages where a substantial
>portion of the interpreter can be required for even the simplest functions.
In terms of good instruction cache locality, interpreted languages win
hugely. They also tend to win on data locality as well. Think about
how they really work. I haven't seen many interpreted OO languages
for some reason, maybe there the virtual functions do cause problems.
If the branch target misses the cache and a new DRAM page has to be
opened, yes it does. Branches don't do PUSHA/POPA. Memory access
is still 100x CPU clock.
> >>Current and recent processors (about 5 years for x86, more for SPARC
> >>and others) support speculative execution and out of order execution
> >>to reduce this problem.
> >
> >It doesn't reduce the problem, rather makes it occur less often (when
> >the planets line up). The "100 cycles" is still there. Memory with a
> >100ns access and a 1GHz CPU kinda makes access 100x clock.
>
> Wow, the last time i saw ram with 100 ns access times was back in the
> 386 days. Even then you could get 70 ns and 60 ns premium parts.
> Current stuff is like 12 ns to 15 ns access and 60 ns to 85 ns cycle
> times with multiple consecutive address available at 5 ns intervals.
Try measuring apples to apples (access a closed page). You'll find
that shiny new memory isn't all that much faster than that of
twenty years ago. Current processors aren't 1GHz, either. The
ratio is still ~100:1.
> The only place you get killed is on cache writeback block outs, that
> does have 100 ns plus lags before reading the new data (but that does
> not apply to instruction caches).
Huh? I'm missing your point here. Cache castouts aren't be in the
performance path.
A code cache miss only occurs on a misspredicted branch. Branch
prediction becomes the key.
> Especially on x86, where instructions are variable length and so can't
> even be decoded if prior instructions are missing, let alone executed.
Variable instruction length doesn't come into it. One instruction
can't be fetched until all prior instructions have been fetched.
> Even on a RISC architecture, it makes such techniques much less efficient.
> If the missing instructions include a branch which will usually be taken,
> speculatively executing subsequent instructions is a waste of cycles.
If it's a branch usually taken, those instructions *will* be in
cache and speculatively executed. Only branches usually not-taken
cause pipeline flushes. That's the whole point of tracking
not/taken.
> Similarly for an instruction which uses the contents of a register which
> is modified by a missing prior instruction.
How can you have a missing prior instruction? How did you get here
without the prior instruction already having been fetched?
Instruction fetch is an in-order operation.
> If anything, it is the reliance upon caching and instruction re-ordering
> which makes code cache misses such a performance killer, as they mitigated
> the problems with slow RAM to such an extent that there was little
> incentive to increase RAM speed.
Increasing RAM speed is *expensive*. Size is far more important
because missing RAM is far worse. How many processor cycles does
it take to swap pages?
> Although such techniques worked well for "classical" procedural code, they
> often worked rather less well for e.g. object-oriented code making heavy
> use of virtual functions, or interpreted languages where a substantial
> portion of the interpreter can be required for even the simplest functions.
They still work, though obviously the working set is larger for an
interpreted language.
>>>>> It depends how fast your RAM is. At one point (I guess around 5 years
>>>>> ago), 350 CPU cycles for a code cache miss was not atypical, but RAM
>>>>> speed has been consistently increasing faster than CPU speed for the
>>>>> last few years.
>>>>
>>>>To remove a possible source of confusion: cycle "costs" take into account
>>>>the fact that each core can execute multiple instructions concurrently
>>>>(superscalar architecture). So a cost of e.g. "100 cycles" refers to a
>>>>delay in which a sequence of instructions totalling 100 cycles could be
>>>>executed, not 100 times the CPU clock period.
>>>
>>> So you have heard of pipeline bubbling. The pipelines are not that
>>> deep, about 7 stages max due to complexity increases.
>>> Current and recent processors (about 5 years for x86, more for SPARC
>>> and others) support speculative execution and out of order execution
>>> to reduce this problem.
>>
>>Indeed. But while that mitigates data cache misses, it doesn't do anything
>>for a code cache miss.
>
> That does not seem to follow.
>
>>
>>Especially on x86, where instructions are variable length and so can't
>>even be decoded if prior instructions are missing, let alone executed.
>
> Kindly explain how you get past the previous instruction to begin
> decoding the current instruction without decoding the previous
> instruction.
Well, that's my point.
If the CPU tries to execute the next instruction, but the data isn't
immediately available, it can just move on to the following instructions.
OTOH, if it tries to execute the next instruction, but the *instruction*
isn't available, it's stuck. It can't do anything until the instruction is
available.
If the CPU can normally execute several instructions per clock (regardless
of whether their operands are immediately available), a code cache miss
means that it has to wait for the existing transfer to complete, then
start a new transfer and wait for that to start producing instructions,
meaning that it's going to end up many instructions behind where it would
have been compared to a cache hit.
>>Even on a RISC architecture, it makes such techniques much less efficient.
>>If the missing instructions include a branch which will usually be taken,
>>speculatively executing subsequent instructions is a waste of cycles.
>
> What? If you speculative execute both results of a conditional branch
> the pipeline stays full and you just drop the unused results in the
> bit bucket. The result: conditional branches without significant
> pipeline bubbles.
Taking both branches is inefficient if one is far more likely than the
other; that's the purpose of branch prediction. E.g. with a loop, the
branch which executes the next iteration is usually far more common than
the branch which exits the loop, so you're better off just speculatively
executing the next iteration than hedging your bets.
>>Similarly for an instruction which uses the contents of a register which
>>is modified by a missing prior instruction.
>
> Ah, you misunderstand the methods and practice of out of order
> execution. See the Tomasulo algorithm, and register scoreboarding
> (which requires "phantom registers").
Only works if you know what the previous instruction is.
If you have:
mov r3,[something]
add r0,r1,r2
but [something] isn't available, you can still commence the add
immediately.
OTOH, if you have:
mov r2,[something]
add r0,r1,r2
the addition can be commenced but can't proceed until r2 holds a concrete
value.
And if you have:
<missing instruction>
add r0,r1,r2
you can't even commence the add with an abstract value for r2, because you
have no idea where the value will eventually come from. r2 may even hold a
concrete value, but you don't know that until you've seen the missing
instruction.
Out-of-order and speculative execution avoid CPU stalls due to data
cache misses, but they either don't work or are significantly less
efficient for a code cache miss.
>>If anything, it is the reliance upon caching and instruction re-ordering
>>which makes code cache misses such a performance killer, as they mitigated
>>the problems with slow RAM to such an extent that there was little
>>incentive to increase RAM speed.
>
> That is non-factual and barely coherent.
Which part of it is unclear?
>>Although such techniques worked well for "classical" procedural code, they
>>often worked rather less well for e.g. object-oriented code making heavy
>>use of virtual functions, or interpreted languages where a substantial
>>portion of the interpreter can be required for even the simplest functions.
>
> In terms of good instruction cache locality, interpreted languages win
> hugely.
That depends heavily upon the language complexity. For a simple
language like BASIC, with few types and few primitive operations, you'll
probably get good locality.
For a complex language with many variants on the basic types, something as
simple as adding a list of numbers can end up calling a dozen different
addition functions (int+float, int+double, arbitrary-precision-int+double,
...), and it may have to go through several steps just to determine the
correct function for each value (e.g. Python will first check whether the
LHS has an __add__ method, if not then whether the RHS has an __radd__
(reverse add) method, then if either side has type-cast methods, ...).
> They also tend to win on data locality as well.
Interpreted languages are more likely to have values dynamically allocated
and referenced through pointers. E.g. if the value is a 3-tuple, you get a
pointer to the tuple which holds 3 pointers to the individual values,
which themselves may contain additional levels of indirection, and the
various pointers are pointing all over the heap.
Compared to C, where most values are either stored at a small offset from
the frame pointer, or are one level of indirection away (i.e. a struct
pointer stored at a small offset from the frame pointer).
> Think about
> how they really work. I haven't seen many interpreted OO languages
> for some reason, maybe there the virtual functions do cause problems.
JavaScript, Python and C# are all interpreted languages with a strong OO
bias.
I can't speak to implementation issues for JavaScript or C#, but Python
is extremely dynamic.
Everything is an object. Retrieving a member value from an object involves
first checking whether the object has __getattr__ or __getattribute__
methods; if it does, the method is called with the name of the field to
retrieve the value. This is also done for methods, which are just members
which happen to be functions.
None of this can be done a priori due to dynamic typing. Functions don't
require their argument values to belong to a specific class, just that
they contain the members which the function uses. E.g. a function which
expects a file argument might only care that the object has a read()
method (which has to be retrieved by name each time).
From what I know of JavaScript, it isn't much different. Its primitive
operations are more primitive, but being template-based rather than
class-based means that you still can't optimise based upon the expected
type of an object, as the code has to work with any object providing the
correct interface, with no knowledge of its underlying implementation.
I am amazed at how badly you misread this.
Here is a typical manufacturer's website discussing DDR2 memory.
Please note the availability rather comprehensive timing diagrams for
speeds up to DDR2-800. If you understand this manufacturers
literature correctly the total read latency is on the order of 20 ns
(worst case). For the processor to be 100 x faster cycle time would
have to be less than 200 ps, which would be equivalent to about a 20
GHz clock. Current parts are about 1.5 to 3 GHz clocks. Best
possible speed ratio in CPU favor 20 to 1.
>
>> >>Current and recent processors (about 5 years for x86, more for SPARC
>> >>and others) support speculative execution and out of order execution
>> >>to reduce this problem.
>> >
>> >It doesn't reduce the problem, rather makes it occur less often (when
>> >the planets line up). The "100 cycles" is still there. Memory with a
>> >100ns access and a 1GHz CPU kinda makes access 100x clock.
>>
>> Wow, the last time i saw ram with 100 ns access times was back in the
>> 386 days. Even then you could get 70 ns and 60 ns premium parts.
>> Current stuff is like 12 ns to 15 ns access and 60 ns to 85 ns cycle
>> times with multiple consecutive address available at 5 ns intervals.
>
>Try measuring apples to apples (access a closed page). You'll find
>that shiny new memory isn't all that much faster than that of
>twenty years ago. Current processors aren't 1GHz, either. The
>ratio is still ~100:1.
What do you mean by a closed page?
And current memory is not 60 ns any more either, average effective
access when used correctly is more like 6 ns.
>
>> The only place you get killed is on cache writeback block outs, that
>> does have 100 ns plus lags before reading the new data (but that does
>> not apply to instruction caches).
>
>Huh? I'm missing your point here. Cache castouts aren't be in the
>performance path.
>
The issue is dirty cache page write back (data segments) in order to
load a new page.
For modern devices and even several older ones (pentium, SPARC,
PA-RISC, i860, i960, S/360 and S/370, DEC VAX, DEC Alpha, and more)
you do not have a point. Out of order execution requires decoding of
each instruction to determine data dependencies. If the next
instruction(s or more) does not have data dependencies, continuing
with execution does no harm. It will not work for a 6502 or a 8080.
>
>OTOH, if it tries to execute the next instruction, but the *instruction*
>isn't available, it's stuck. It can't do anything until the instruction is
>available.
>
>If the CPU can normally execute several instructions per clock (regardless
>of whether their operands are immediately available), a code cache miss
>means that it has to wait for the existing transfer to complete, then
>start a new transfer and wait for that to start producing instructions,
>meaning that it's going to end up many instructions behind where it would
>have been compared to a cache hit.
Do you think for some reason that there is only block of instructions
mapped in the cache? After the first time through a loop both
sequences get mapped if they are not already.
>
>>>Even on a RISC architecture, it makes such techniques much less efficient.
>>>If the missing instructions include a branch which will usually be taken,
>>>speculatively executing subsequent instructions is a waste of cycles.
>>
>> What? If you speculative execute both results of a conditional branch
>> the pipeline stays full and you just drop the unused results in the
>> bit bucket. The result: conditional branches without significant
>> pipeline bubbles.
>
>Taking both branches is inefficient if one is far more likely than the
>other; that's the purpose of branch prediction. E.g. with a loop, the
>branch which executes the next iteration is usually far more common than
>the branch which exits the loop, so you're better off just speculatively
>executing the next iteration than hedging your bets.
Only partially true. More aggressive cache preloading will have both
sequences available for when the loop gets near termination.
>
>>>Similarly for an instruction which uses the contents of a register which
>>>is modified by a missing prior instruction.
>>
>> Ah, you misunderstand the methods and practice of out of order
>> execution. See the Tomasulo algorithm, and register scoreboarding
>> (which requires "phantom registers").
>
>Only works if you know what the previous instruction is.
>
>If you have:
>
> mov r3,[something]
> add r0,r1,r2
>
>but [something] isn't available, you can still commence the add
>immediately.
>
>OTOH, if you have:
>
> mov r2,[something]
> add r0,r1,r2
>
>the addition can be commenced but can't proceed until r2 holds a concrete
>value.
Correct as far as it goes. But what if the next instruction is div
r4,r5,r6; it can be executed because it has no unsatisfied data
dependencies. See how it works?
>
>And if you have:
>
> <missing instruction>
> add r0,r1,r2
>
>you can't even commence the add with an abstract value for r2, because you
>have no idea where the value will eventually come from. r2 may even hold a
>concrete value, but you don't know that until you've seen the missing
>instruction.
This case cannot occur except at boot time when there are exactly 0
previously executed instructions.
>
>Out-of-order and speculative execution avoid CPU stalls due to data
>cache misses, but they either don't work or are significantly less
>efficient for a code cache miss.
Oh contraire. Because there are still instruction(s) in execution it
reduces the size of the pipeline bubble.
>
>>>If anything, it is the reliance upon caching and instruction re-ordering
>>>which makes code cache misses such a performance killer, as they mitigated
>>>the problems with slow RAM to such an extent that there was little
>>>incentive to increase RAM speed.
>>
>> That is non-factual and barely coherent.
>
>Which part of it is unclear?
Code cache misses are note the killer you pretend because of the
pipeline and TLB start the memory access well in advance of the actual
need for the code to be present. Moreover it causes a burst read
which has a better transfer rate and hugely reduces latency for the
next few instructions.
>
>>>Although such techniques worked well for "classical" procedural code, they
>>>often worked rather less well for e.g. object-oriented code making heavy
>>>use of virtual functions, or interpreted languages where a substantial
>>>portion of the interpreter can be required for even the simplest functions.
>>
>> In terms of good instruction cache locality, interpreted languages win
>> hugely.
>
>That depends heavily upon the language complexity. For a simple
>language like BASIC, with few types and few primitive operations, you'll
>probably get good locality.
>
>For a complex language with many variants on the basic types, something as
>simple as adding a list of numbers can end up calling a dozen different
>addition functions (int+float, int+double, arbitrary-precision-int+double,
>...), and it may have to go through several steps just to determine the
>correct function for each value (e.g. Python will first check whether the
>LHS has an __add__ method, if not then whether the RHS has an __radd__
>(reverse add) method, then if either side has type-cast methods, ...).
Now you are grasping at hypothetical straws.
You are not making your case here. Go more for deep down details to
support your case. Without the deeper facts you are not all that
credible. Try to emulate Larry Wahl.
Perhaps. I'll look at this again later.
<snip>
> >> The only place you get killed is on cache writeback block outs, that
> >> does have 100 ns plus lags before reading the new data (but that does
> >> not apply to instruction caches).
> >
> >Huh? I'm missing your point here. Cache castouts aren't be in the
> >performance path.
> >
>
> The issue is dirty cache page write back (data segments) in order to
> load a new page.
Page? The LINE isn't cast out until the read of the new line is
complete (and the memory bus idle). The castout isn't in the
critical performance path. The read is.
>>>>Especially on x86, where instructions are variable length and so can't
>>>>even be decoded if prior instructions are missing, let alone executed.
>>>
>>> Kindly explain how you get past the previous instruction to begin
>>> decoding the current instruction without decoding the previous
>>> instruction.
>>
>>Well, that's my point.
>>
>>If the CPU tries to execute the next instruction, but the data isn't
>>immediately available, it can just move on to the following instructions.
>
> For modern devices and even several older ones (pentium, SPARC,
> PA-RISC, i860, i960, S/360 and S/370, DEC VAX, DEC Alpha, and more)
> you do not have a point. Out of order execution requires decoding of
> each instruction to determine data dependencies. If the next
> instruction(s or more) does not have data dependencies, continuing
> with execution does no harm. It will not work for a 6502 or a 8080.
That's the second time in a row that you've claimed disagreement then
made the same point I was making.
No-one is disputing that these techniques mitigate data cache misses; the
point under discussion is:
>>OTOH, if it tries to execute the next instruction, but the *instruction*
>>isn't available, it's stuck. It can't do anything until the instruction is
>>available.
>>If the CPU can normally execute several instructions per clock (regardless
>>of whether their operands are immediately available), a code cache miss
>>means that it has to wait for the existing transfer to complete, then
>>start a new transfer and wait for that to start producing instructions,
>>meaning that it's going to end up many instructions behind where it would
>>have been compared to a cache hit.
>
> Do you think for some reason that there is only block of instructions
> mapped in the cache? After the first time through a loop both
> sequences get mapped if they are not already.
Are you saying that code cache misses don't happen? Even once data is read
into the cache, it doesn't stay there forever.
>>Taking both branches is inefficient if one is far more likely than the
>>other; that's the purpose of branch prediction. E.g. with a loop, the
>>branch which executes the next iteration is usually far more common than
>>the branch which exits the loop, so you're better off just speculatively
>>executing the next iteration than hedging your bets.
>
> Only partially true. More aggressive cache preloading will have both
> sequences available for when the loop gets near termination.
Easier said than done. Caching instructions which are
executed after the loop terminates should be a lower priority than caching
anything which executed during the loop (including any subroutines called
from the loop). Also, If you pre-load too early, the data risks being
discarded before it is ever used.
> Correct as far as it goes. But what if the next instruction is div
> r4,r5,r6; it can be executed because it has no unsatisfied data
> dependencies. See how it works?
Yes, I know how out-of-order and speculative execution work. The point,
which you keep overlooking, is that it relies upon knowing the
instructions. A code cache miss means that there aren't any instructions
to speculatively execute, i.e. the CPU is at a complete halt until it gets
those instructions.
When you have a CPU which can do 100 clocks in the time it takes to
initiate a transfer, and can do 3 instructions per clock, that's a
significant delay.
>>And if you have:
>>
>> <missing instruction>
>> add r0,r1,r2
>>
>>you can't even commence the add with an abstract value for r2, because you
>>have no idea where the value will eventually come from. r2 may even hold a
>>concrete value, but you don't know that until you've seen the missing
>>instruction.
>
> This case cannot occur except at boot time when there are exactly 0
> previously executed instructions.
Did you actually interpret the above as meaning that there were no
instructions prior to <missing instruction>, or are you acting dumb for
the sake of argument?
>>Out-of-order and speculative execution avoid CPU stalls due to data
>>cache misses, but they either don't work or are significantly less
>>efficient for a code cache miss.
>
> Oh contraire. Because there are still instruction(s) in execution it
> reduces the size of the pipeline bubble.
The number of "active" (commenced but not completed) instructions is
likely to be in single digits. That isn't going to make much of a dent in
a 300-cycle stall (unless they're all double-precision divisions).
>>>>If anything, it is the reliance upon caching and instruction re-ordering
>>>>which makes code cache misses such a performance killer, as they mitigated
>>>>the problems with slow RAM to such an extent that there was little
>>>>incentive to increase RAM speed.
>>>
>>> That is non-factual and barely coherent.
>>
>>Which part of it is unclear?
>
> Code cache misses are note the killer you pretend because of the
> pipeline and TLB start the memory access well in advance of the actual
> need for the code to be present. Moreover it causes a burst read
> which has a better transfer rate and hugely reduces latency for the
> next few instructions.
You're assuming either very few branches (or indirect jumps), or very
accurate prediction. That may be true if you're writing Fortran to
evaluate an algebraic formula, but it's not true in general.
>>For a complex language with many variants on the basic types, something as
>>simple as adding a list of numbers can end up calling a dozen different
>>addition functions (int+float, int+double, arbitrary-precision-int+double,
>>...), and it may have to go through several steps just to determine the
>>correct function for each value (e.g. Python will first check whether the
>>LHS has an __add__ method, if not then whether the RHS has an __radd__
>>(reverse add) method, then if either side has type-cast methods, ...).
>
> Now you are grasping at hypothetical straws.
No, I'm explaining what modern interpreted languages are really like. If
you're going to write in BASIC, you may as well compile it. The main
reason for using interpreted languages is the flexibility provided by
dynamic dispatch.
> You are not making your case here. Go more for deep down details to
> support your case. Without the deeper facts you are not all that
> credible.
You may not understand the case being made, and making exactly the same
points in an attempt to refute it suggests that is the case.
Oops. forgot to add the link:
www.samsung.com/global/business/semiconductor/products/dram/downloads/ddr2_device_operation_timing_diagram_may_07.pdf
The very link dates it to May 2007.
Moreover check this, even though it is Wikipedia:
http://en.wikipedia.org/wiki/DDR2_SDRAM
Take a look at those cycle times and peak transfer rates. Sustained
rates are maybe 20% of peak but still quite a bit of data moving.
Improbable. If the cache line/page selected to be replaced is "dirty"
it must be written back before reading the new data or the changes
will be lost. Thus the dirty line/page overhead is present in some
fraction (usually small) of cache read attempts with misses. Balancing
allocated write bandwidth with read performance and cache miss rate is
the design issue.
See below for question of cost of code cache miss cost.
You keep insisting that memory read is far slower than it actually is.
Using recent single core, superscalar, Pentium processors (the most
popular and easiest to get data for), look up the number of execution
units, the maximum number of simultaneously executing instructions,
the number of clock cycles needed to execute the instructions (by
profile, is does vary) etc for a processor with say a 2.4 GHz clock,
connected to say 1 GB normal DDR2-6400 and other normal bandwidth
loads on the busses. For reasonable access and cycle times (misnamed
on the page) use this Wikipedia page:
http://en.wikipedia.org/wiki/DDR2_SDRAM
How many clocks are needed to get a code space read (for a cache miss)
from memory for this configuration?
How big of a bubble does that put in the pipeline?
>
>>>And if you have:
>>>
>>> <missing instruction>
>>> add r0,r1,r2
>>>
>>>you can't even commence the add with an abstract value for r2, because you
>>>have no idea where the value will eventually come from. r2 may even hold a
>>>concrete value, but you don't know that until you've seen the missing
>>>instruction.
>>
>> This case cannot occur except at boot time when there are exactly 0
>> previously executed instructions.
>
>Did you actually interpret the above as meaning that there were no
>instructions prior to <missing instruction>, or are you acting dumb for
>the sake of argument?
No, you were acting dumb about the missing instruction. I explained
the only case it could occur and you disunderstood. Even with
branches instruction fetch is fundamentally sequential, you cannot
fetch instruction x+1 until you have fetched instruction x.
>
>>>Out-of-order and speculative execution avoid CPU stalls due to data
>>>cache misses, but they either don't work or are significantly less
>>>efficient for a code cache miss.
>>
>> Oh contraire. Because there are still instruction(s) in execution it
>> reduces the size of the pipeline bubble.
>
>The number of "active" (commenced but not completed) instructions is
>likely to be in single digits. That isn't going to make much of a dent in
>a 300-cycle stall (unless they're all double-precision divisions).
>
>>>>>If anything, it is the reliance upon caching and instruction re-ordering
>>>>>which makes code cache misses such a performance killer, as they mitigated
>>>>>the problems with slow RAM to such an extent that there was little
>>>>>incentive to increase RAM speed.
>>>>
>>>> That is non-factual and barely coherent.
>>>
>>>Which part of it is unclear?
>>
>> Code cache misses are not the killer you pretend because of the
>> pipeline and TLB start the memory access well in advance of the actual
>> need for the code to be present. Moreover it causes a burst read
>> which has a better transfer rate and hugely reduces latency for the
>> next few instructions.
>
>You're assuming either very few branches (or indirect jumps), or very
>accurate prediction. That may be true if you're writing Fortran to
>evaluate an algebraic formula, but it's not true in general.
From the profiler; about 3% to 10% branches, with about 75% successful
prediction. See above for discussion of clocks for memory access
time. I am NOT discussing using swap disk with its millisecond class
overhead.
>
>>>For a complex language with many variants on the basic types, something as
>>>simple as adding a list of numbers can end up calling a dozen different
>>>addition functions (int+float, int+double, arbitrary-precision-int+double,
>>>...), and it may have to go through several steps just to determine the
>>>correct function for each value (e.g. Python will first check whether the
>>>LHS has an __add__ method, if not then whether the RHS has an __radd__
>>>(reverse add) method, then if either side has type-cast methods, ...).
>>
>> Now you are grasping at hypothetical straws.
>
>No, I'm explaining what modern interpreted languages are really like. If
>you're going to write in BASIC, you may as well compile it. The main
>reason for using interpreted languages is the flexibility provided by
>dynamic dispatch.
Have you ever pounded your way through the lowest level code of an
interpreter? They emulate a nonexistent virtual machine. The issues
related to dynamic dispatch in compiled code do not obtain in
interpreted code. The process is that different.
It doesn't even have a psychiatrist. Emacs has a built-in psychiatrist.
--
Paul Hovnanian mailto:Pa...@Hovnanian.com
------------------------------------------------------------------
Time's fun when you're having flies. -- Kermit the Frog
Nope. It's written to the store queue where it can be written back to
memory at the processor's leisure.
> Thus the dirty line/page overhead is present in some
>fraction (usually small) of cache read attempts with misses. Balancing
>allocated write bandwidth with read performance and cache miss rate is
>the design issue.
Write bandwidth has nothing to do with it, except in the pathological
case where you miss on every fetch.
>>When you have a CPU which can do 100 clocks in the time it takes to
>>initiate a transfer, and can do 3 instructions per clock, that's a
>>significant delay.
>
> You keep insisting that memory read is far slower than it actually is.
Let's back up a couple of days to the point at which I entered this thread:
> Date: Sat, 24 Jan 2009 19:39:32 +0000
> Message-Id: <pan.2009.01.24....@nowhere.com>
> It depends how fast your RAM is. At one point (I guess around 5 years
> ago), 350 CPU cycles for a code cache miss was not atypical, but RAM
> speed has been consistently increasing faster than CPU speed for the
> last few years.
So, the 350 cycles figure would have been from the era of PC-133 or the
first generation of DDR. That would have been the last time that I wrote
code which involved timing diagrams.
>>>>And if you have:
>>>>
>>>> <missing instruction>
>>>> add r0,r1,r2
>>>>
>>>>you can't even commence the add with an abstract value for r2, because you
>>>>have no idea where the value will eventually come from. r2 may even hold a
>>>>concrete value, but you don't know that until you've seen the missing
>>>>instruction.
>>>
>>> This case cannot occur except at boot time when there are exactly 0
>>> previously executed instructions.
>>
>>Did you actually interpret the above as meaning that there were no
>>instructions prior to <missing instruction>, or are you acting dumb for
>>the sake of argument?
>
> No, you were acting dumb about the missing instruction. I explained
> the only case it could occur and you disunderstood. Even with
> branches instruction fetch is fundamentally sequential, you cannot
> fetch instruction x+1 until you have fetched instruction x.
You can displace instruction x from the cache without displacing
instruction x+1.
>>You're assuming either very few branches (or indirect jumps), or very
>>accurate prediction. That may be true if you're writing Fortran to
>>evaluate an algebraic formula, but it's not true in general.
>
> From the profiler; about 3% to 10% branches, with about 75% successful
> prediction.
75% isn't "very accurate"; and for getting two consecutive branches
correct, that's only 56%. For code with a lot of conditionals, you can't
assume that prefetching is going ensure that the right instructions are
cached.
>>No, I'm explaining what modern interpreted languages are really like. If
>>you're going to write in BASIC, you may as well compile it. The main
>>reason for using interpreted languages is the flexibility provided by
>>dynamic dispatch.
>
> Have you ever pounded your way through the lowest level code of an
> interpreter? They emulate a nonexistent virtual machine. The issues
> related to dynamic dispatch in compiled code do not obtain in
> interpreted code. The process is that different.
At the lowest level, it is examining each instruction and invoking the
corresponding primitive. But this isn't like emulating a real CPU where
you have maybe a couple of dozen common instructions; the Python core
is around 2500 primitives in 750KB of code. When the VM implementation is
hopping all over that much code, you aren't going to have it all in the
cache.
Not so much a pathological case, but a low frequency case where all
cache lines/pages are dirty at the time of a cache miss.
>On Tue, 27 Jan 2009 19:31:34 -0800, JosephKK wrote:
>
>>>When you have a CPU which can do 100 clocks in the time it takes to
>>>initiate a transfer, and can do 3 instructions per clock, that's a
>>>significant delay.
>>
>> You keep insisting that memory read is far slower than it actually is.
>
>Let's back up a couple of days to the point at which I entered this thread:
>
>> Date: Sat, 24 Jan 2009 19:39:32 +0000
>> Message-Id: <pan.2009.01.24....@nowhere.com>
>
>> It depends how fast your RAM is. At one point (I guess around 5 years
>> ago), 350 CPU cycles for a code cache miss was not atypical, but RAM
>> speed has been consistently increasing faster than CPU speed for the
>> last few years.
>
>So, the 350 cycles figure would have been from the era of PC-133 or the
>first generation of DDR. That would have been the last time that I wrote
>code which involved timing diagrams.
This will set up some time line referents to work with:
http://www.dewassoc.com/performance/memory/how_to_id_pc133.htm
Taking 1999 as a useful base year lets look at processors:
http://www.pdfdownload.org/pdf2html/pdf2html.php?url=http%3A%2F%2Fwww.connellybarnes.com%2Fdocuments%2Fcpu_speed.pdf&images=yes
So in 1999 Intel (IA-32) CPU speed was about 500 MHz And memory speed
was about 133 MHz. That is not 300:1, it is not even 10:1 and it took
multiple clocks to execute most instructions (early Pentium, not even
P2).
>
>>>>>And if you have:
>>>>>
>>>>> <missing instruction>
>>>>> add r0,r1,r2
>>>>>
>>>>>you can't even commence the add with an abstract value for r2, because you
>>>>>have no idea where the value will eventually come from. r2 may even hold a
>>>>>concrete value, but you don't know that until you've seen the missing
>>>>>instruction.
>>>>
>>>> This case cannot occur except at boot time when there are exactly 0
>>>> previously executed instructions.
>>>
>>>Did you actually interpret the above as meaning that there were no
>>>instructions prior to <missing instruction>, or are you acting dumb for
>>>the sake of argument?
>>
>> No, you were acting dumb about the missing instruction. I explained
>> the only case it could occur and you disunderstood. Even with
>> branches instruction fetch is fundamentally sequential, you cannot
>> fetch instruction x+1 until you have fetched instruction x.
>
>You can displace instruction x from the cache without displacing
>instruction x+1.
Possible in some cache schemes, not in most.
>
>>>You're assuming either very few branches (or indirect jumps), or very
>>>accurate prediction. That may be true if you're writing Fortran to
>>>evaluate an algebraic formula, but it's not true in general.
>>
>> From the profiler; about 3% to 10% branches, with about 75% successful
>> prediction.
>
>75% isn't "very accurate"; and for getting two consecutive branches
>correct, that's only 56%. For code with a lot of conditionals, you can't
>assume that prefetching is going ensure that the right instructions are
>cached.
Changes very little. Back to back branches are uncommon to rare.
>
>>>No, I'm explaining what modern interpreted languages are really like. If
>>>you're going to write in BASIC, you may as well compile it. The main
>>>reason for using interpreted languages is the flexibility provided by
>>>dynamic dispatch.
>>
>> Have you ever pounded your way through the lowest level code of an
>> interpreter? They emulate a nonexistent virtual machine. The issues
>> related to dynamic dispatch in compiled code do not obtain in
>> interpreted code. The process is that different.
>
>At the lowest level, it is examining each instruction and invoking the
>corresponding primitive. But this isn't like emulating a real CPU where
>you have maybe a couple of dozen common instructions; the Python core
>is around 2500 primitives in 750KB of code. When the VM implementation is
>hopping all over that much code, you aren't going to have it all in the
>cache.
Code locality is a function of most used primitives as well, thus the
less common primitives cause most of the cache misses and the
commonest primitives are almost always in cache.
Same thing with data locality. I trust you learned about that as
well.
No. You're incorrectly assuming that a cache line has to be written
from the "CPU" into main memory immediately when a cache line has to
be cast out. This is *NOT* the case. The write pipe isn't direct.
There is a store queue that is around the read pipe. The store
happens after the read in all cases except where the store queue is
already full. That requires the pathological case where there aren't
any memory slots to do the writes.
> Taking 1999 as a useful base year lets look at processors:
> So in 1999 Intel (IA-32) CPU speed was about 500 MHz And memory speed
> was about 133 MHz. That is not 300:1, it is not even 10:1 and it took
> multiple clocks to execute most instructions (early Pentium, not even
> P2).
Pentium 3.
Pentium Pro only went up to 200MHz, Pentium 2 up to 450MHz. Pentium 3 came
out in 1999, at 450/500MHz.
At the other end of the scale, my P3/800 used PC-133, and there were P3s
up to 1100MHz with a 100MHz FSB and 1400MHz with a 133MHz FSB. That's the
kind of system where 300 clocks is feasible for a code cache miss.
After that, DDR appeared, and memory finally started to catch up with the
CPU. But prior to that, you had
>>> No, you were acting dumb about the missing instruction. I explained
>>> the only case it could occur and you disunderstood. Even with
>>> branches instruction fetch is fundamentally sequential, you cannot
>>> fetch instruction x+1 until you have fetched instruction x.
>>
>>You can displace instruction x from the cache without displacing
>>instruction x+1.
>
> Possible in some cache schemes, not in most.
Possible in any cache scheme. If instruction x+1 is a branch target, it
can be both more recently used and more used recently than instruction x.
>>>>You're assuming either very few branches (or indirect jumps), or very
>>>>accurate prediction. That may be true if you're writing Fortran to
>>>>evaluate an algebraic formula, but it's not true in general.
>>>
>>> From the profiler; about 3% to 10% branches, with about 75% successful
>>> prediction.
>>
>>75% isn't "very accurate"; and for getting two consecutive branches
>>correct, that's only 56%. For code with a lot of conditionals, you can't
>>assume that prefetching is going ensure that the right instructions are
>>cached.
>
> Changes very little. Back to back branches are uncommon to rare.
That depends upon the type of code you're writing. Obviously, branches
which are "exactly" back-to-back are rare, but test,branch,test,branch
isn't that uncommon; an extreme case is code which embodies a domain of
knowledge, classifying its input then applying the corresponding rules
(IOW, something akin to a Lisp "cond" statement, except that you would
normally try to use a hierarchical decision tree rather than performing
the tests sequentially).
>>>>No, I'm explaining what modern interpreted languages are really like. If
>>>>you're going to write in BASIC, you may as well compile it. The main
>>>>reason for using interpreted languages is the flexibility provided by
>>>>dynamic dispatch.
>>>
>>> Have you ever pounded your way through the lowest level code of an
>>> interpreter? They emulate a nonexistent virtual machine. The issues
>>> related to dynamic dispatch in compiled code do not obtain in
>>> interpreted code. The process is that different.
>>
>>At the lowest level, it is examining each instruction and invoking the
>>corresponding primitive. But this isn't like emulating a real CPU where
>>you have maybe a couple of dozen common instructions; the Python core
>>is around 2500 primitives in 750KB of code. When the VM implementation is
>>hopping all over that much code, you aren't going to have it all in the
>>cache.
>
> Code locality is a function of most used primitives as well, thus the
> less common primitives cause most of the cache misses and the
> commonest primitives are almost always in cache.
That's fine if you have a handful of common primitives and the rest are
rare, but glancing over Python's primitives, I'd say that fully half of
them are common. The kind of code which would only use a handful of
primitives is the kind of code which you would write in C.
>On Wed, 28 Jan 2009 18:48:09 -0800, JosephKK wrote:
>
>> Taking 1999 as a useful base year lets look at processors:
>
>> So in 1999 Intel (IA-32) CPU speed was about 500 MHz And memory speed
>> was about 133 MHz. That is not 300:1, it is not even 10:1 and it took
>> multiple clocks to execute most instructions (early Pentium, not even
>> P2).
>
>Pentium 3.
>
>Pentium Pro only went up to 200MHz, Pentium 2 up to 450MHz. Pentium 3 came
>out in 1999, at 450/500MHz.
>
>At the other end of the scale, my P3/800 used PC-133, and there were P3s
>up to 1100MHz with a 100MHz FSB and 1400MHz with a 133MHz FSB. That's the
>kind of system where 300 clocks is feasible for a code cache miss.
By the way, why did you snip away my references?
This will set up some time line referents to work with:
http://www.dewassoc.com/performance/memory/how_to_id_pc133.htm
Taking 1999 as a useful base year lets look at processors:
http://www.pdfdownload.org/pdf2html/pdf2html.php?url=http%3A%2F%2Fwww.connellybarnes.com%2Fdocuments%2Fcpu_speed.pdf&images=yes
Just the same, even with your unsupported values:
Lets see, even about 10 to 1 clock speed difference cannot translate
into over 100 to 1 time cost.
> By the way, why did you snip away my references?
Because they weren't necessary to comprehend my reply. I'm used to fora
where quoting entire messages is frowned upon.
> Lets see, even about 10 to 1 clock speed difference cannot translate
> into over 100 to 1 time cost.
Clock speed alone tells you nothing. How many clocks is the worst-case
latency, assuming an existing burst is in progress on a different row?
>On Sat, 31 Jan 2009 19:30:18 -0800, JosephKK wrote:
>
>> By the way, why did you snip away my references?
>
>Because they weren't necessary to comprehend my reply. I'm used to fora
>where quoting entire messages is frowned upon.
I snip from time to time.
You are taking an interesting position, because they completely
undercut your premises.
>
>> Lets see, even about 10 to 1 clock speed difference cannot translate
>> into over 100 to 1 time cost.
>
>Clock speed alone tells you nothing. How many clocks is the worst-case
>latency, assuming an existing burst is in progress on a different row?
>
If you want to include clocks to complete things the typically higher
CPU clocks per instruction (about 7 to 12 for 90% of instruction
stream on X86, ignoring pipelining) compared to clocks per memory
access (typically 3 to 5 without burst, 5 to 11 with burst in
progress) still comes out against you.
>>> Lets see, even about 10 to 1 clock speed difference cannot translate
>>> into over 100 to 1 time cost.
>>
>>Clock speed alone tells you nothing. How many clocks is the worst-case
>>latency, assuming an existing burst is in progress on a different row?
>>
>
> If you want to include clocks to complete things the typically higher
> CPU clocks per instruction (about 7 to 12 for 90% of instruction
> stream on X86, ignoring pipelining) compared to clocks per memory
> access (typically 3 to 5 without burst, 5 to 11 with burst in
> progress) still comes out against you.
If so, worst case would be 11 memory clocks, with 10 CPU clocks per memory
clock, 3 instructions (or 3 cycles' worth of instructions) per CPU cycle =
330 cycles.
Or is there some reason why that cannot happen? Remember, we're talking
worst case, not average case (average case is a cache hit). And worst-case
isn't always some obscure theoretical concept. It's not hard to write code
which is memory-bound (so there will usually be a burst in progress) and
has poor cache coherence (so cache misses are common), and an instruction
fetch will typically be for a different row than a data fetch.
>On Sun, 01 Feb 2009 11:42:46 -0800, JosephKK wrote:
>
>>>> Lets see, even about 10 to 1 clock speed difference cannot translate
>>>> into over 100 to 1 time cost.
>>>
>>>Clock speed alone tells you nothing. How many clocks is the worst-case
>>>latency, assuming an existing burst is in progress on a different row?
>>>
>>
>> If you want to include clocks to complete things the typically higher
>> CPU clocks per instruction (about 7 to 12 for 90% of instruction
>> stream on X86, ignoring pipelining) compared to clocks per memory
>> access (typically 3 to 5 without burst, 5 to 11 with burst in
>> progress) still comes out against you.
>
>If so, worst case would be 11 memory clocks, with 10 CPU clocks per memory
>clock, 3 instructions (or 3 cycles' worth of instructions) per CPU cycle =
>330 cycles.
Where in Finnegans fictional fantasies did you get this weird
arithmetic? Where did the 3 instructions come from? Where do you get
10 CPU clocks per memory clock? Neither one is supported by the
facts. Look again at the references:
This will set up some time line referents to work with:
http://www.dewassoc.com/performance/memory/how_to_id_pc133.htm
Taking 1999 as a useful base year lets look at processors:
http://www.pdfdownload.org/pdf2html/pdf2html.php?url=http%3A%2F%2Fwww.connellybarnes.com%2Fdocuments%2Fcpu_speed.pdf&images=yes
The clock ratios you claim just are not there.
>
>Or is there some reason why that cannot happen? Remember, we're talking
>worst case, not average case (average case is a cache hit). And worst-case
>isn't always some obscure theoretical concept. It's not hard to write code
>which is memory-bound (so there will usually be a burst in progress) and
>has poor cache coherence (so cache misses are common), and an instruction
>fetch will typically be for a different row than a data fetch.
While it is possible to write pathological code in assembler, higher
level languages will generally prevent it. It may be possible to
brute force "C" in this way, but it will readily recognizable as
pathological.
>>>>> Lets see, even about 10 to 1 clock speed difference cannot translate
>>>>> into over 100 to 1 time cost.
>>>>
>>>>Clock speed alone tells you nothing. How many clocks is the worst-case
>>>>latency, assuming an existing burst is in progress on a different row?
>>>>
>>>>
>>> If you want to include clocks to complete things the typically higher
>>> CPU clocks per instruction (about 7 to 12 for 90% of instruction stream
>>> on X86, ignoring pipelining) compared to clocks per memory access
>>> (typically 3 to 5 without burst, 5 to 11 with burst in progress) still
>>> comes out against you.
>>
>>If so, worst case would be 11 memory clocks, with 10 CPU clocks per
>>memory clock, 3 instructions (or 3 cycles' worth of instructions) per CPU
>>cycle = 330 cycles.
>
> Where in Finnegans fictional fantasies did you get this weird arithmetic?
> Where did the 3 instructions come from?
It's called "superscalar"; I thought that you understood this concept.
PentiumPro and later can execute multiple instructions concurrently,
commencing and completing up to 3 instructions per clock cycle.
> Where do you get 10 CPU clocks per memory clock?
1100MHz CPU with 100MHz FSB, 1400MHz CPU with 133MHz FSB. Odd that you
didn't take issue with it a few messages back.
> Neither one is supported by the facts. Look again at
> the references:
>
> This will set up some time line referents to work with:
> http://www.dewassoc.com/performance/memory/how_to_id_pc133.htm
>
> Taking 1999 as a useful base year lets look at processors:
> http://www.pdfdownload.org/pdf2html/pdf2html.php?url=
http%3A%2F%2Fwww.connellybarnes.com%2Fdocuments%2Fcpu_speed.pdf&images=yes
>
> The clock ratios you claim just are not there.
Nor is anything else there, AFAICT. The first one doesn't appear to
mention CPU speeds. The second one just has a banner and "no file", but
using the original (non-pdfdownload.org) URL gives a PDF which charts CPU
speed against year, with no mention of FSB speeds.
Here are some actual references:
http://processorfinder.intel.com/details.aspx?sSpec=SL5XL
CPU Speed: 1.40 GHz
Bus Speed: 133 MHz
Bus/Core Ratio: 10.5
http://processorfinder.intel.com/details.aspx?sSpec=SL4BR
CPU Speed: 1 GHz
Bus Speed: 100 MHz
Bus/Core Ratio: 10
>>Or is there some reason why that cannot happen? Remember, we're talking
>>worst case, not average case (average case is a cache hit). And
>>worst-case isn't always some obscure theoretical concept. It's not hard
>>to write code which is memory-bound (so there will usually be a burst in
>>progress) and has poor cache coherence (so cache misses are common), and
>>an instruction fetch will typically be for a different row than a data
>>fetch.
>
> While it is possible to write pathological code in assembler, higher level
> languages will generally prevent it. It may be possible to brute force
> "C" in this way, but it will readily recognizable as pathological.
That's not even remotely true. Any code which performs simple calculations
on large amounts of data is inherently memory bound (i.e. there is
always an outstanding transfer).
The most obvious case of code with poor cache coherence is OO code where
an abstract base class has many subclasses.
For a concrete example, a 3D game engine will typically have abstract
"brush" and "actor" classes, the first representing immutable
terrain (walls, floors), the second representing dynamic entities
(enemies, weapons, ordnance, other mutable objects, ...).
Updating the game state involves iterating over a set of actors, but the
code executed for each one depends upon the final class (updating a zombie
is quite different from updating a bullet). You can realistically end up
calling over 100 distinct update methods for a single frame.
Rendering is similar, although there a fewer distinct methods (but the
number is continually increasing with the use of specialised shaders,
procedural textures etc) but more data (you have to render both terrain
and actors, but terrain doesn't need updating).
>On Sun, 01 Feb 2009 22:20:08 -0800, JosephKK wrote:
>
>>>>>> Lets see, even about 10 to 1 clock speed difference cannot translate
>>>>>> into over 100 to 1 time cost.
>>>>>
>>>>>Clock speed alone tells you nothing. How many clocks is the worst-case
>>>>>latency, assuming an existing burst is in progress on a different row?
>>>>>
>>>>>
>>>> If you want to include clocks to complete things the typically higher
>>>> CPU clocks per instruction (about 7 to 12 for 90% of instruction stream
>>>> on X86, ignoring pipelining) compared to clocks per memory access
>>>> (typically 3 to 5 without burst, 5 to 11 with burst in progress) still
>>>> comes out against you.
>>>
>>>If so, worst case would be 11 memory clocks, with 10 CPU clocks per
>>>memory clock, 3 instructions (or 3 cycles' worth of instructions) per CPU
>>>cycle = 330 cycles.
>>
>> Where in Finnegans fictional fantasies did you get this weird arithmetic?
>> Where did the 3 instructions come from?
>
>It's called "superscalar"; I thought that you understood this concept.
>PentiumPro and later can execute multiple instructions concurrently,
>commencing and completing up to 3 instructions per clock cycle.
I understand it very well, PPro does not have the capability. It only
began to appear in P4s. Moreover, many of its supposed attributes are
seriously mimicked by pipelining, which PPro and P2 and above do have.
True superscalar requires multiple execution units, which only P4 and
later have (other architectures had it earlier). SPARC is the only
single chip architecture that got past 3 execution units that i know
of (IBM 3090 did 4). Intel figured that full cores were actually
easier to implement, it seem they are right.
>
>> Where do you get 10 CPU clocks per memory clock?
>
>1100MHz CPU with 100MHz FSB, 1400MHz CPU with 133MHz FSB. Odd that you
>didn't take issue with it a few messages back.
By the time i bothered to check, the issue had shifted, by the time
1100 MHz to 1400 MHz CPU cores had appeared we had DDR 333 ram.
Provide references for your data points as i have.
Sorry, i don't care much about inefficiently designed/implemented
games, nor am i gamer. That type are hoist on their own petard.
Do you have any other type of benchmark data?
>>>>If so, worst case would be 11 memory clocks, with 10 CPU clocks per
>>>>memory clock, 3 instructions (or 3 cycles' worth of instructions) per CPU
>>>>cycle = 330 cycles.
>>>
>>> Where in Finnegans fictional fantasies did you get this weird arithmetic?
>>> Where did the 3 instructions come from?
>>
>>It's called "superscalar"; I thought that you understood this concept.
>>PentiumPro and later can execute multiple instructions concurrently,
>>commencing and completing up to 3 instructions per clock cycle.
>
> I understand it very well, PPro does not have the capability. It only
> began to appear in P4s. Moreover, many of its supposed attributes are
> seriously mimicked by pipelining, which PPro and P2 and above do have.
> True superscalar requires multiple execution units, which only P4 and
> later have (other architectures had it earlier). SPARC is the only
> single chip architecture that got past 3 execution units that i know
> of (IBM 3090 did 4). Intel figured that full cores were actually
> easier to implement, it seem they are right.
Okay, so we're arguing over terminology again. Intel considers pipelining
to be what the original Pentium had, calling PPro upwards superscaler.
Even without multiple ALUs, PPro upwards can execute multiple load/store
operations concurrently, alongside one integer and one FP operation.
>>> Where do you get 10 CPU clocks per memory clock?
>>
>>1100MHz CPU with 100MHz FSB, 1400MHz CPU with 133MHz FSB. Odd that you
>>didn't take issue with it a few messages back.
>
> By the time i bothered to check, the issue had shifted, by the time
> 1100 MHz to 1400 MHz CPU cores had appeared we had DDR 333 ram.
>
> Provide references for your data points as i have.
I did:
>>http://processorfinder.intel.com/details.aspx?sSpec=SL5XL
>>
>>CPU Speed: 1.40 GHz
>>Bus Speed: 133 MHz
>>Bus/Core Ratio: 10.5
>>
>>http://processorfinder.intel.com/details.aspx?sSpec=SL4BR
>>
>>CPU Speed: 1 GHz
>>Bus Speed: 100 MHz
>>Bus/Core Ratio: 10
DDR 333 may have existed at this point, but so did CPUs with 100/133 FSB.
>>> While it is possible to write pathological code in assembler, higher level
>>> languages will generally prevent it. It may be possible to brute force
>>> "C" in this way, but it will readily recognizable as pathological.
>>
>>That's not even remotely true. Any code which performs simple calculations
>>on large amounts of data is inherently memory bound (i.e. there is
>>always an outstanding transfer).
>>
>>The most obvious case of code with poor cache coherence is OO code where
>>an abstract base class has many subclasses.
>>
>>For a concrete example, a 3D game engine will typically have abstract
>>"brush" and "actor" classes, the first representing immutable
>>terrain (walls, floors), the second representing dynamic entities
>>(enemies, weapons, ordnance, other mutable objects, ...).
>>
>>Updating the game state involves iterating over a set of actors, but the
>>code executed for each one depends upon the final class (updating a zombie
>>is quite different from updating a bullet). You can realistically end up
>>calling over 100 distinct update methods for a single frame.
>>
>>Rendering is similar, although there a fewer distinct methods (but the
>>number is continually increasing with the use of specialised shaders,
>>procedural textures etc) but more data (you have to render both terrain
>>and actors, but terrain doesn't need updating).
>
> Sorry, i don't care much about inefficiently designed/implemented
> games, nor am i gamer. That type are hoist on their own petard.
Oh; so it's the programmers' fault for not writing megabytes of code in
hand-tuned assembler?
Real-world code doesn't look anything like the Fortran or Pascal examples
you may have learned in college, or the kind of code you would write for a
microcontroller.
> Do you have any other type of benchmark data?
So you can dismiss that too (along with anything else which contradicts
your claims) as a "pathological" case.
But there's no point in citing specific examples. Just download any
substantial software package for which source code is available
(especially anything written in C++).
If you're programming x86 (i.e. PCs/servers), software where 99% of the
CPU cycles are spent in a few KiB of code is the exception rather than the
rule.
>On Mon, 02 Feb 2009 20:48:30 -0800, JosephKK wrote:
>
>>>>>If so, worst case would be 11 memory clocks, with 10 CPU clocks per
>>>>>memory clock, 3 instructions (or 3 cycles' worth of instructions) per CPU
>>>>>cycle = 330 cycles.
>>>>
>>>> Where in Finnegans fictional fantasies did you get this weird arithmetic?
>>>> Where did the 3 instructions come from?
>>>
<snip>
>>>Rendering is similar, although there a fewer distinct methods (but the
>>>number is continually increasing with the use of specialised shaders,
>>>procedural textures etc) but more data (you have to render both terrain
>>>and actors, but terrain doesn't need updating).
>>
>> Sorry, i don't care much about inefficiently designed/implemented
>> games, nor am i gamer. That type are hoist on their own petard.
>
>Oh; so it's the programmers' fault for not writing megabytes of code in
>hand-tuned assembler?
Off the point quite a bit.
>
>Real-world code doesn't look anything like the Fortran or Pascal examples
>you may have learned in college, or the kind of code you would write for a
>microcontroller.
Many in this ng write mucho real world assembler for microcontrollers
for real world products and create new prosperity by doing so. How
could it not be and "look like" real world code?
>
>> Do you have any other type of benchmark data?
>
>So you can dismiss that too (along with anything else which contradicts
>your claims) as a "pathological" case.
>
>But there's no point in citing specific examples. Just download any
>substantial software package for which source code is available
>(especially anything written in C++).
>
>If you're programming x86 (i.e. PCs/servers), software where 99% of the
>CPU cycles are spent in a few KiB of code is the exception rather than the
>rule.
Wild, while i know of servers that do not spend all their time in the
"idle loop", my workplace desktop and my home desktops spend about 99%
of their time waiting for something to do. I think i would like to
have some large enough SPICE circuits to take enough time to be worth
attempting to profile usage while they were running.
>>Real-world code doesn't look anything like the Fortran or Pascal examples
>>you may have learned in college, or the kind of code you would write for a
>>microcontroller.
>
> Many in this ng write mucho real world assembler for microcontrollers
> for real world products and create new prosperity by doing so. How
> could it not be and "look like" real world code?
I should have said "real world PC code", as it's specifically the x86
which started this sub-thread.
The type of code you would write for a microcontroller wouldn't have
cache coherence issues if it was run on a PC; even a Celeron's cache is
larger than the entire combined RAM + flash of many microcontrollers.
And microcontrollers typically don't run at speeds where everything has to
be engineered around memory latency.
But that's not the kind of software Intel/AMD CPUs are designed for.
And game code is hardly "inefficient" by PC standards; it's one of the few
areas where performance is actually considered important (contrast with
e.g. Windows itself, or MS-Office, or similar "bloatware").
Fine. I will go my own way now.