Question on scalability of multi-core Processors

Stephen Fuld

unread,

Dec 28, 2008, 7:05:24 PM12/28/08

to

Near the end of a lecture on the history of binary translation, available at

http://www.uwtv.org/programs/displayevent.aspx?rID=27625&fID=4946

Dave Ditzel gave a formula and showed a chart resulting from that
formula that indicated that power would limit the number of cores in a
multi-core processor to *far* less than a Moore's law extrapolation
would predict, and start that effect pretty soon. Of course, his
solution is binary translation to reduce power per core, but I have a
more basic question. Is his formula and the attendant assumptions
reasonable? If so, then is there some solution lurking out there
besides his?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Bernd Paysan

unread,

Dec 28, 2008, 8:18:49 PM12/28/08

to

Stephen Fuld wrote:
> Dave Ditzel gave a formula and showed a chart resulting from that
> formula that indicated that power would limit the number of cores in a
> multi-core processor to *far* less than a Moore's law extrapolation
> would predict, and start that effect pretty soon. Of course, his
> solution is binary translation to reduce power per core, but I have a
> more basic question. Is his formula and the attendant assumptions
> reasonable? If so, then is there some solution lurking out there
> besides his?

Oh, we know that Moore's law makes power consumption up when area stays
constant; and we keep area of CPUs constant by increasing the number of
cores. However, the assumption is that the architecture stays the same,
which is certainly wrong. Nobody in his right mind would make a quad-core
Pentium 4 processor. So the solution is: Use less power-hungry
microarchitectures. That's a no-brainer. Power-hungry microarchitectures
are from the single-core past. And when you look at the power consumption
of an Athlon X2 and a Phenom, you see that most effort for the Phenom
apparently went into reducing said power consumption per core.

Dave Ditzel says that binary translation will give a less power hungry
microarchitecture. I.e. you can still use your old binary, and go to a less
power hungry architecture. Such architectures exist, look at the GPUs:
teraflops with the power budget that gives you hardly a 100 gigaflops (SP)
with an x86-style CPU. Now Intel claims that it's no big problem to make a
GPU with x86 compatibility, but the main point is that whether you program
Rx00 from AMD or Larrabee from Intel, it's not going to be a very
conventional program, anyway. You won't write it in assembler, either;
OpenCL or something like that probably (OpenCL certainly also is "binary
translation", but from source to ISA. Apple's LLVM, which will be used in
Mac OS X's OpenCL implementation, certainly applies all possibilities of a
compiler and a binary translator). And it's only the performance-hungry
core of your application.

My opinion is that an exposed ISA is better than a binary translation
target, and that binary translation is only necessary for legacy programs,
which don't work on many-core architectures anyway. So why bother?

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Thomas Womack

unread,

Dec 29, 2008, 5:07:46 AM12/29/08

to

In article <p4ll26-...@vimes.paysan.nom>,
Bernd Paysan <bernd....@gmx.de> wrote:

>My opinion is that an exposed ISA is better than a binary translation
>target, and that binary translation is only necessary for legacy programs,
>which don't work on many-core architectures anyway. So why bother?

I get the impression that a lot of the fast progress in GPU
performance over the last decade was because the GPU sat behind a
driver layer with non-trivial software in it which straddled the
compiler / binary-translater boundary, and so could have an unexposed
ISA which could be changed with each generation. The GPU
many-small-cores model would be badly harmed if each core had to have
a P6-class hardware binary translator to convert DirectX 7 shaders
into what the underlying hardware spoke.

It's the same issue with FPGAs: it's annoying that the architecture
changes every eighteen months (particularly since that means that the
architecture is only really accessible to extremely proprietary
compilers), but the Xilinx 6216 with its open architecture (and open
compilers) was never very successful and died off quite quickly. If
you can get better performance by allowing the fab to provide you with
a very non-uniform collection of long-range routing resources which
can only really be used by a genetic optimiser in the compiler chain,
people go for that rather than slower clear uniformity.

There is a factor ~1.5 or so between compiling with gcc and with the
Intel compiler which coevolves to some extent with their CPUs, but
that seems a hit people are more prepared to take.

Tom

Bernd Paysan

unread,

Dec 29, 2008, 7:22:04 AM12/29/08

to

Thomas Womack wrote:

> In article <p4ll26-...@vimes.paysan.nom>,
> Bernd Paysan <bernd....@gmx.de> wrote:
>
>>My opinion is that an exposed ISA is better than a binary translation
>>target, and that binary translation is only necessary for legacy programs,
>>which don't work on many-core architectures anyway. So why bother?
>
> I get the impression that a lot of the fast progress in GPU
> performance over the last decade was because the GPU sat behind a
> driver layer with non-trivial software in it which straddled the
> compiler / binary-translater boundary, and so could have an unexposed
> ISA which could be changed with each generation.

With "exposed ISA" I mean an ISA exposed to the compiler and to the driver
writer. That way, most of the performance can be used, because people know
what they target. Most people don't write directly for the hardware (but
use a higher level language), and there indeed, it's necessary to not
expose the ISA to them. I.e. don't distribute binaries, distribute some
sort of "source code".

> There is a factor ~1.5 or so between compiling with gcc and with the
> Intel compiler which coevolves to some extent with their CPUs, but
> that seems a hit people are more prepared to take.

But that's not only due to GCC's more generic and portable approach, but
also due to the accumulated cruft in this compiler. After 20 years, it
seemed to become unmaintainable.

Boon

unread,

Dec 29, 2008, 11:24:44 AM12/29/08

to

Bernd Paysan wrote:

> Apple's LLVM, which will be used in Mac OS X's OpenCL implementation

I was not aware that LLVM had become Apple's. When did this happen ? :-)

http://www.appleinsider.com/articles/08/06/20/apples_other_open_secret_the_llvm_complier.html

(The LLVM project started in 2000 at the University of Illinois at
Urbana-Champaign.)

Regards.

Stephen Fuld

unread,

Dec 29, 2008, 11:29:25 AM12/29/08

to

Bernd Paysan wrote:
> Stephen Fuld wrote:
>> Dave Ditzel gave a formula and showed a chart resulting from that
>> formula that indicated that power would limit the number of cores in a
>> multi-core processor to *far* less than a Moore's law extrapolation
>> would predict, and start that effect pretty soon. Of course, his
>> solution is binary translation to reduce power per core, but I have a
>> more basic question. Is his formula and the attendant assumptions
>> reasonable? If so, then is there some solution lurking out there
>> besides his?
>
> Oh, we know that Moore's law makes power consumption up when area stays
> constant; and we keep area of CPUs constant by increasing the number of
> cores. However, the assumption is that the architecture stays the same,
> which is certainly wrong. Nobody in his right mind would make a quad-core
> Pentium 4 processor. So the solution is: Use less power-hungry
> microarchitectures. That's a no-brainer. Power-hungry microarchitectures
> are from the single-core past. And when you look at the power consumption
> of an Athlon X2 and a Phenom, you see that most effort for the Phenom
> apparently went into reducing said power consumption per core.

That all makes sense, but surely there is a limit to how much single
core performance a customer is willing to give up to get more cores.
Don't you reach a point where reducing the gate count starts to really
hurt performance?

> Dave Ditzel says that binary translation will give a less power hungry
> microarchitecture. I.e. you can still use your old binary, and go to a less
> power hungry architecture.

Yes, that is his position.

> Such architectures exist, look at the GPUs:
> teraflops with the power budget that gives you hardly a 100 gigaflops (SP)
> with an x86-style CPU. Now Intel claims that it's no big problem to make a
> GPU with x86 compatibility,

I'm not sure they are claiming that exactly :-) I think the utility of
Larrabee is yet to be proved.

> but the main point is that whether you program
> Rx00 from AMD or Larrabee from Intel, it's not going to be a very
> conventional program, anyway. You won't write it in assembler, either;
> OpenCL or something like that probably (OpenCL certainly also is "binary
> translation", but from source to ISA. Apple's LLVM, which will be used in
> Mac OS X's OpenCL implementation, certainly applies all possibilities of a
> compiler and a binary translator). And it's only the performance-hungry
> core of your application.

I'll have to spend some time studying Open CL. Thanks for the reference.

> My opinion is that an exposed ISA is better than a binary translation
> target, and that binary translation is only necessary for legacy programs,
> which don't work on many-core architectures anyway. So why bother?

Well, since the companies that are producing the largest number of
multi-core processors (or at least the ones that project much larger
number of cores) are also the ones committed to legacy (i.e.
X86)support, there seems to be a reason. There is a huge amount of
legacy code that a company can't just assume will get recompiled.

Bernd Paysan

unread,

Dec 29, 2008, 11:40:54 AM12/29/08

to

Stephen Fuld wrote:

> Well, since the companies that are producing the largest number of
> multi-core processors (or at least the ones that project much larger
> number of cores) are also the ones committed to legacy (i.e.
> X86)support, there seems to be a reason. There is a huge amount of
> legacy code that a company can't just assume will get recompiled.

But no matter what you do, these legacy programs won't benefit of these
many-core architectures, anyway. They are single-threaded Windows programs!
The only way they have to use the many-core GPGPUs is via DirectX.

MitchAlsup

unread,

Dec 29, 2008, 6:02:02 PM12/29/08

to

A couple of years ago, I wrote a long essey on multi-processors on a
single chip in this newsgroup.

A couple of point to reiterate:
A) one can build 1/2 the performance of the great big cores in
1/12-1/15th of the area of the great-big-cores
B) one can build 1/2 the performance of the great big cores in
1/10-1/30th of the power of the great-big-cores*
C) the limitations to the growth of on-die cores is total available
memory bandwidth
D) ultimately it will be pin bandwidth that will limit the number of
cores on a die

We have been living in the world where it was acceptable to get the
one-big-node and we were willing to pay the price {$$$, power, die-
area}. We are now entering a different world; brought forth from the
memory wall, the power wall, and the synchronization wall. When these
all fully "in play" the optimal microarchitecture will no longer be of
the great-big-core variety.

Mitch
(*) with potential for 1/100 the power consumption of the great-big-
cores

MitchAlsup

unread,

Dec 29, 2008, 7:11:27 PM12/29/08

to

On Dec 28, 6:05 pm, Stephen Fuld <S.F...@PleaseRemove.att.net> wrote:

> Dave Ditzel gave a formula and showed a chart resulting from that
> formula that indicated that power would limit the number of cores in a
> multi-core processor to *far* less than a Moore's law extrapolation
> would predict, and start that effect pretty soon. Of course, his
> solution is binary translation to reduce power per core, but I have a
> more basic question. Is his formula and the attendant assumptions
> reasonable? If so, then is there some solution lurking out there
> besides his?

To first order his equations of::
A) 0.9**generation for voltage is reasonable
B) 0.8**generation for nodal capacitance is resonable
C) 1.2**generation for frequency is probbly not reasonable

on C):: If we look back over the last 18 months we have seen 2 new
generations from 2 companies and have not seen ANY frequency
increases.

But notice how his spreadsheet example changes if you suddenly shed
90% of your power and only loose 50% of your performance as
illustrated in my previous note.

Mitch

Stephen Fuld

unread,

Dec 30, 2008, 11:25:56 AM12/30/08

to

I take from this and your previous post that not only is the single core
performance not going to improve, it is actually going to decline as
time goes on. That is, in order to make room in the die and power
budget for more cores the performance of each core will be less than
that of the cores in the previous generation. Or, yet a third
restatement, we are going to actually sacrifice single core performance
for more cores per die. Is this a valid conclusion from your statements?

I note that with the latest Intel generations, micro architecture
improvements have seemed to make at least small improvements in single
core performance in each generation, but barring any real breakthroughs,
that seems to be a diminishing returns game.

Dave Ditzel seems to believe that you can use the additional cores to
run an updates version of code morphing software that would actually get
the morphed code to perform better than it would on a "traditional" X86
core. Given Transmeta's record, I am dubious, but perhaps.

I accept that we will no longer see dramatic improvements in single core
performance, but if it actually declines in future generations, I am
fearful of the results in marketing terms.

Also, I very much take your point about pin/memory limitations. AFAICT,
the big advantage of a separate graphics chip is that it it gives you a
lot more pins to use for memory interface, or to put it another way, it
takes a lot of the video output memory traffic off the main memory.
With integrated graphics, this goes away.

Anton Ertl

unread,

Dec 30, 2008, 2:21:03 PM12/30/08

to

Stephen Fuld <S.F...@PleaseRemove.att.net> writes:
>I take from this and your previous post that not only is the single core
>performance not going to improve, it is actually going to decline as
>time goes on. That is, in order to make room in the die and power
>budget for more cores the performance of each core will be less than
>that of the cores in the previous generation.

If we have CPUs with a uniform set of cores, and go for manycore.
OTOH, Intel could put one or two high-performance cores in the
package, to make legacy code run satisfactorily, and also put, say 16
Atom-style cores in the package for applications that use many threads
well. To see if this flies in the market, they could even do this as
an MCM: One Nehalem-style chip and one multi-Atom chip, connected via
Hypertransport (sorry, don't remember Intel's name for that), each
talking to some memory.

Of course with CPUs that integrate GPUs, we will see something similar
to this from Intel anyway, but the graphics cores are not quite
equivalent to CPU cores as seen by software, so that's an issue.

The problem with such a proposal is that the OS schedulers are not
very good in my experience even now, they will take their time to
adapt to heterogeneous-performance cores, and after that time they
will still get it wrong.

>Also, I very much take your point about pin/memory limitations. AFAICT,
>the big advantage of a separate graphics chip is that it it gives you a
>lot more pins to use for memory interface, or to put it another way, it
>takes a lot of the video output memory traffic off the main memory.

Video output memory traffic is a relatively small part of memory
bandwidth for modern graphics chips: Even 2560x1600x32@60Hz only needs
1GB/s bandwidth. You don't need memory interfaces with 50GB/s and
more for that. AFAIK reading textures and geometry data, and reading
and writing to the Z-Buffer consume a lot of traffic.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Stephen Fuld

unread,

Dec 30, 2008, 4:49:09 PM12/30/08

to

Anton Ertl wrote:
> Stephen Fuld <S.F...@PleaseRemove.att.net> writes:
>> I take from this and your previous post that not only is the single core
>> performance not going to improve, it is actually going to decline as
>> time goes on. That is, in order to make room in the die and power
>> budget for more cores the performance of each core will be less than
>> that of the cores in the previous generation.
>
> If we have CPUs with a uniform set of cores, and go for manycore.
> OTOH, Intel could put one or two high-performance cores in the
> package, to make legacy code run satisfactorily, and also put, say 16
> Atom-style cores in the package for applications that use many threads
> well. To see if this flies in the market, they could even do this as
> an MCM: One Nehalem-style chip and one multi-Atom chip, connected via
> Hypertransport (sorry, don't remember Intel's name for that), each
> talking to some memory.
>
> Of course with CPUs that integrate GPUs, we will see something similar
> to this from Intel anyway, but the graphics cores are not quite
> equivalent to CPU cores as seen by software, so that's an issue.
>
> The problem with such a proposal is that the OS schedulers are not
> very good in my experience even now, they will take their time to
> adapt to heterogeneous-performance cores, and after that time they
> will still get it wrong.

Good points. Intel could certainly do as you suggest, :-) and
certainly the OS people will have a hard time dealing with it. :-(

>> Also, I very much take your point about pin/memory limitations. AFAICT,
>> the big advantage of a separate graphics chip is that it it gives you a
>> lot more pins to use for memory interface, or to put it another way, it
>> takes a lot of the video output memory traffic off the main memory.
>
> Video output memory traffic is a relatively small part of memory
> bandwidth for modern graphics chips: Even 2560x1600x32@60Hz only needs
> 1GB/s bandwidth.

I'm not talking about the video output, but all the traffic that goes
to/from the high speed, dedicated memory on current graphics cards.
Presumably, on a system where the graphics core is integrated, that
traffic must compete with the rest of the traffic generated by the
other, "regular" CPUs.

MitchAlsup

unread,

Dec 30, 2008, 5:34:43 PM12/30/08

to

On Dec 30, 10:25 am, Stephen Fuld <S.F...@PleaseRemove.att.net> wrote:
> I take from this and your previous post that not only is the single core
> performance not going to improve, it is actually going to decline as
> time goes on. That is, in order to make room in the die and power
> budget for more cores the performance of each core will be less than
> that of the cores in the previous generation. Or, yet a third
> restatement, we are going to actually sacrifice single core performance
> for more cores per die. Is this a valid conclusion from your statements?

Only after the cost of synchronizing is fully addressed can single
core performance be lowered and achieve a useful gain in overall
performance, until then microarchitecture are fish out of water.

Mitch

Boon

unread,

Dec 30, 2008, 5:54:42 PM12/30/08

to

Anton Ertl wrote:

> [...] they could even do this as an MCM: One Nehalem-style chip and

> one multi-Atom chip, connected via Hypertransport (sorry, don't
> remember Intel's name for that)

http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect

"QuickPath Interconnect" or "QuickPath" or "QPI"

Regards.

Gavin Scott

unread,

Dec 30, 2008, 6:03:34 PM12/30/08

to

Stephen Fuld <S.F...@pleaseremove.att.net> wrote:
> I take from this and your previous post that not only is the single core
> performance not going to improve, it is actually going to decline as
> time goes on. That is, in order to make room in the die and power
> budget for more cores the performance of each core will be less than
> that of the cores in the previous generation.

Well, Intel does some rather clever things in their new Core i7
processors. There are four cores, but those that are inactive can
be shut down to a nearly zero power state, and when not all cores
are in use the remaining core(s) are dynamically *overclocked* past
the normal rated speed of the chip in one or two steps of 133MHz,
allowing the power and cooling budget to be apportioned as needed.

So a single-threaded application may actually get one core at, say,
3,266 Mhz where a multi-threaded app would see four cores at 3,000 MHz
on a "3GHz" part.

Summary and links to details at:

http://en.wikipedia.org/wiki/Intel_Core_i7

G.

Anton Ertl

unread,

Dec 31, 2008, 6:03:37 AM12/31/08

to

Ok.

Yes, one advantage is that the separate graphics chip has additional
pins. A probably bigger advantage is that the graphics chip talks to
a known number of soldered RAM devices, allowing much higher clock
frequencies in the interface.

>Presumably, on a system where the graphics core is integrated, that
>traffic must compete with the rest of the traffic generated by the
>other, "regular" CPUs.

Yes, it will be interesting to see what memory configurations these
systems will support and have. Currently some chipset graphics
support separate graphics memory, but the motherboard manufacturers
don't use this option. Apparently those people who would pay extra
for that prefer to buy separate graphics cards.

If this trend continues, the CPU-integrated graphics will also be a
low-end solution, and we will not see fast extra graphics memory for
these CPUs. Of course, if we get fast extra memory that's also
efficiently accessible by the CPU cores, things will get interesting.

Terje Mathisen

unread,

Dec 31, 2008, 8:57:07 AM12/31/08

to

Stephen Fuld wrote:
> that of the cores in the previous generation. Or, yet a third
> restatement, we are going to actually sacrifice single core performance
> for more cores per die. Is this a valid conclusion from your statements?

I think this is pretty obvious, with one caveat:

There is nothing stopping Intel from making hybrid chips, with maybe 1-4
Core2 class "fast" cores and 30-60 LRB-style
power-efficient/throughput-optimized cores.

>
> I note that with the latest Intel generations, micro architecture
> improvements have seemed to make at least small improvements in single
> core performance in each generation, but barring any real breakthroughs,
> that seems to be a diminishing returns game.

Right.

>
> Dave Ditzel seems to believe that you can use the additional cores to
> run an updates version of code morphing software that would actually get
> the morphed code to perform better than it would on a "traditional" X86
> core. Given Transmeta's record, I am dubious, but perhaps.

I believe this might be a case of "if your only tool is a hammer, then
all problems looks like nails": Dave D is probably a lot smarter than
me, but it seems that he's been working on JIT-style code mor.phing for
so long that it is hard to accept that it might not be the final answer
to everything.

>
> I accept that we will no longer see dramatic improvements in single core
> performance, but if it actually declines in future generations, I am
> fearful of the results in marketing terms.

Which is why I expect at least one or two fast cores, that will spend
almost all the time in a totally idle state, except when running cpu
benchmarks.

(Most single-user CPUs are of course equally idle 99% of the time
anyway. :-)

>
> Also, I very much take your point about pin/memory limitations. AFAICT,
> the big advantage of a separate graphics chip is that it it gives you a
> lot more pins to use for memory interface, or to put it another way, it
> takes a lot of the video output memory traffic off the main memory. With
> integrated graphics, this goes away.

This is the real biggie:

The screen buffer traffic problem will only go away when the entire
video memory, including multiple frame buffers, z-buffers, texture
caches etc all fit inside the multicore cpu:

The size of the screen working set is still growing, but I believe it
grows slower than the total cache size that can fit on a single cpu, and
95% or more of all users will be quite happy with 60 Hz full HD
playback, or playing games at similar resolution and frame rates.

The first-generation LRB chips are supposed to be similar to Cell, with
maybe 256 KB RAM on each core: With 32 cores that is a total of 8 MB.

Another few powers of two, and the working set of most high-end games
will indeed fit nicely.

Running HPC processing tasks, with lots of cross-core communication and
working sets well above 256K per core will be _much_ more difficult to
do well.

Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Stephen Fuld

unread,

Dec 31, 2008, 11:45:30 AM12/31/08

to

Anton Ertl wrote:
> Stephen Fuld <S.F...@PleaseRemove.att.net> writes:
>> Anton Ertl wrote:
>>> Stephen Fuld <S.F...@PleaseRemove.att.net> writes:
>>>> Also, I very much take your point about pin/memory limitations. AFAICT,
>>>> the big advantage of a separate graphics chip is that it it gives you a
>>>> lot more pins to use for memory interface, or to put it another way, it
>>>> takes a lot of the video output memory traffic off the main memory.
>>> Video output memory traffic is a relatively small part of memory
>>> bandwidth for modern graphics chips: Even 2560x1600x32@60Hz only needs
>>> 1GB/s bandwidth.
>> I'm not talking about the video output, but all the traffic that goes
>> to/from the high speed, dedicated memory on current graphics cards.
>
> Ok.
>
> Yes, one advantage is that the separate graphics chip has additional
> pins. A probably bigger advantage is that the graphics chip talks to
> a known number of soldered RAM devices, allowing much higher clock
> frequencies in the interface.

Good point.

>> Presumably, on a system where the graphics core is integrated, that
>> traffic must compete with the rest of the traffic generated by the
>> other, "regular" CPUs.
>
> Yes, it will be interesting to see what memory configurations these
> systems will support and have. Currently some chipset graphics
> support separate graphics memory, but the motherboard manufacturers
> don't use this option. Apparently those people who would pay extra
> for that prefer to buy separate graphics cards.

Right. I would guess that the cost differential between chip set
graphics plus dedicated memory versus a separate graphics card is too
little to make saving the separate graphics chip and connectors, etc.
worthwhile. Also, I would guess that if you have the pins, you are
better off using them for another interface to general purpose memory
thus improving overall performance rather than dedicating them to just
improving graphics.

> If this trend continues, the CPU-integrated graphics will also be a
> low-end solution, and we will not see fast extra graphics memory for
> these CPUs.

Yup.

> Of course, if we get fast extra memory that's also
> efficiently accessible by the CPU cores, things will get interesting.

Yes. Lots of possibilities. Note that Intel's latest CPU finally bit
the bullet and eliminated the MCH by having the memory talk directly to
the CPU. They also now have support for three memory banks. It might
be interesting to allow one of these to run at a faster rate given a
known number of soldered on chips. You would then have a sort of NUMA
system. I'm not sure this is an improvement :-(

Stephen Fuld

unread,

Dec 31, 2008, 11:54:00 AM12/31/08

to

Terje Mathisen wrote:
> Stephen Fuld wrote:
>> that of the cores in the previous generation. Or, yet a third
>> restatement, we are going to actually sacrifice single core
>> performance for more cores per die. Is this a valid conclusion from
>> your statements?
>
> I think this is pretty obvious, with one caveat:
>
> There is nothing stopping Intel from making hybrid chips, with maybe 1-4
> Core2 class "fast" cores and 30-60 LRB-style
> power-efficient/throughput-optimized cores.

Yes, but as Anton pointed out, the OSs will take time to figure out how
to deal with this well.

>> I note that with the latest Intel generations, micro architecture
>> improvements have seemed to make at least small improvements in single
>> core performance in each generation, but barring any real
>> breakthroughs, that seems to be a diminishing returns game.
>
> Right.
>>
>> Dave Ditzel seems to believe that you can use the additional cores to
>> run an updates version of code morphing software that would actually
>> get the morphed code to perform better than it would on a
>> "traditional" X86 core. Given Transmeta's record, I am dubious, but
>> perhaps.
>
> I believe this might be a case of "if your only tool is a hammer, then
> all problems looks like nails": Dave D is probably a lot smarter than
> me, but it seems that he's been working on JIT-style code mor.phing for
> so long that it is hard to accept that it might not be the final answer
> to everything.

I suspect you are right.

>> I accept that we will no longer see dramatic improvements in single
>> core performance, but if it actually declines in future generations, I
>> am fearful of the results in marketing terms.
>
> Which is why I expect at least one or two fast cores, that will spend
> almost all the time in a totally idle state, except when running cpu
> benchmarks.
>
> (Most single-user CPUs are of course equally idle 99% of the time
> anyway. :-)
>>
>> Also, I very much take your point about pin/memory limitations.
>> AFAICT, the big advantage of a separate graphics chip is that it it
>> gives you a lot more pins to use for memory interface, or to put it
>> another way, it takes a lot of the video output memory traffic off the
>> main memory. With integrated graphics, this goes away.
>
> This is the real biggie:
>
> The screen buffer traffic problem will only go away when the entire
> video memory, including multiple frame buffers, z-buffers, texture
> caches etc all fit inside the multicore cpu:
>
> The size of the screen working set is still growing, but I believe it
> grows slower than the total cache size that can fit on a single cpu, and
> 95% or more of all users will be quite happy with 60 Hz full HD
> playback, or playing games at similar resolution and frame rates.

Yes, but I suspect it will still be some time before all graphics memory
fits within the multi core chip. Remember that memory is competing with
increased cache size for the "regular" CPUs. And while increasing cache
is a diminishing returns game, it still does provide some benefit.

> The first-generation LRB chips are supposed to be similar to Cell, with
> maybe 256 KB RAM on each core: With 32 cores that is a total of 8 MB.
>
> Another few powers of two, and the working set of most high-end games
> will indeed fit nicely.

Interesting thought.

> Running HPC processing tasks, with lots of cross-core communication and
> working sets well above 256K per core will be _much_ more difficult to
> do well.

Agreed.

Kai Harrekilde-Petersen

unread,

Dec 31, 2008, 1:45:48 PM12/31/08

to

Stephen Fuld <S.F...@PleaseRemove.att.net> writes:

> Anton Ertl wrote:
>> If this trend continues, the CPU-integrated graphics will also be a
>> low-end solution, and we will not see fast extra graphics memory for
>> these CPUs.
>
> Yup.

Yet, it seems to me that these "low-end" solutions would comfortably
cater for everyone but the gamers (and possibly also the photo/video
production people).

That should be a sizable market, should Intel/AMD be able to convince
the users that they don't need the separate graphics card.

Kai, happily typing away on a box with built-in graphics
--
Kai Harrekilde-Petersen <khp(at)harrekilde(dot)dk>

Bernd Paysan

unread,

Dec 31, 2008, 7:04:59 PM12/31/08

to

Kai Harrekilde-Petersen wrote:
> Yet, it seems to me that these "low-end" solutions would comfortably
> cater for everyone but the gamers (and possibly also the photo/video
> production people).
>
> That should be a sizable market, should Intel/AMD be able to convince
> the users that they don't need the separate graphics card.

AMD already has a solution for everyone but the gamers: Their chipset
integrated graphics is quite good. The only point for moving the GPU into
the CPU die is to save even more cost, i.e. get rid of hypertransport, add
a PCI-e controller directly to the CPU, include the small GPU, SATA,
Ethernet, and USB controller, and then solder the chip directly on the
mainboard.

And BTW: We had this sort of integrated x86 controllers for decades. Most of
the time, they were too low end to be viable. But in the age of netbooks,
who knows? There's now at least a sufficiently large market segment for
this kind of hardware.

Anton Ertl

unread,

Jan 1, 2009, 10:31:44 AM1/1/09

to

Bernd Paysan <bernd....@gmx.de> writes:
>Kai Harrekilde-Petersen wrote:
>> Yet, it seems to me that these "low-end" solutions would comfortably
>> cater for everyone but the gamers (and possibly also the photo/video
>> production people).
>>
>> That should be a sizable market, should Intel/AMD be able to convince
>> the users that they don't need the separate graphics card.
>
>AMD already has a solution for everyone but the gamers: Their chipset
>integrated graphics is quite good.

Concerning Photo/Video, no discrete graphics card is needed for
photos, and for video the only hardware support in graphics chips that
I know of is decoding support, and some integrated graphics solutions
can do that, too.

Concerning "everyone but the gamers", I see at least two other markets:

* Server boards typically have a separate low-end graphics chip
on-board, to avoid wasting main memory bandwidth for video output,
i.e., to maximize CPU performance. Likewise, I guess that
workstations will typically use a discrete graphics card for the same
reason even if only low or no 3D performance is required.

* People who use a 2560x1600 display (ok, photo production people
might be in that category). Chipset graphics typically don't support
dual-link DVI (or at least I guess so, I have a hard time finding that
in the specs), so either one needs to buy a discrete graphics card, a
display with a DisplayPort (there are some motherboards with
DisplayPort; hopefully they support 2560x1600), or a Display-Port to
Dual-Link DVI adapter (which does not seem to exist yet). The
discrete graphics card is probably the cheapest of these options.

Kai Harrekilde-Petersen

unread,

Jan 1, 2009, 11:51:12 AM1/1/09

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:

> Bernd Paysan <bernd....@gmx.de> writes:
>>Kai Harrekilde-Petersen wrote:
>>> Yet, it seems to me that these "low-end" solutions would comfortably
>>> cater for everyone but the gamers (and possibly also the photo/video
>>> production people).
>>>
>>> That should be a sizable market, should Intel/AMD be able to convince
>>> the users that they don't need the separate graphics card.
>>
>>AMD already has a solution for everyone but the gamers: Their chipset
>>integrated graphics is quite good.
>
> Concerning Photo/Video, no discrete graphics card is needed for
> photos, and for video the only hardware support in graphics chips that
> I know of is decoding support, and some integrated graphics solutions
> can do that, too.
>
> Concerning "everyone but the gamers", I see at least two other markets:
>
> * Server boards typically have a separate low-end graphics chip
> on-board, to avoid wasting main memory bandwidth for video output,
> i.e., to maximize CPU performance.

Would that really be necessary, from a performance point of view?
I mean, would they even *bother*, considering cost, power, etc?

> Likewise, I guess that
> workstations will typically use a discrete graphics card for the same
> reason even if only low or no 3D performance is required.

The workstations might fall in under your dual-DVI category. We use a
dual-screen setup extensively in our development group. Current setup
is dual 1600x1200, future setup will be dual 1920x1200 (RTL waveforms
and IC layout eat pixels for breakfast).

Hmmm ... perhaps the photo/video segment would actually fall into the
dual-screen category as well?

Kai
--
Kai Harrekilde-Petersen <khp(at)harrekilde(dot)dk>

Anton Ertl

unread,

Jan 1, 2009, 12:10:08 PM1/1/09

to

Kai Harrekilde-Petersen <k...@harrekilde.dk> writes:

>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>> Concerning "everyone but the gamers", I see at least two other markets:
>>
>> * Server boards typically have a separate low-end graphics chip
>> on-board, to avoid wasting main memory bandwidth for video output,
>> i.e., to maximize CPU performance.
>
>Would that really be necessary, from a performance point of view?
>I mean, would they even *bother*, considering cost, power, etc?

Chipset manufacturers don't build graphics into chipsets intended for
servers, and motherboard manufacturers don't produce server
motherboards with chipsets that integrate graphics. Both probably
think that their customers are prepared to pay the few extra bucks
necessary to avoid that potential performance issue. If the board
costs EUR 500, and the whole machine costs EUR 6000, then I guess the
customers won't try to shave EUR 10 from the total by buying a
solution with chipset graphics that consumes CPU memory bandwidth.
The cost and the power consumption of these low-end graphics chips are
pretty low (they don't even have a heat sink, much less a fan).

>> Likewise, I guess that
>> workstations will typically use a discrete graphics card for the same
>> reason even if only low or no 3D performance is required.
>
>The workstations might fall in under your dual-DVI category. We use a
>dual-screen setup extensively in our development group. Current setup
>is dual 1600x1200, future setup will be dual 1920x1200 (RTL waveforms
>and IC layout eat pixels for breakfast).
>
>Hmmm ... perhaps the photo/video segment would actually fall into the
>dual-screen category as well?

Yes, and dual-screen or triple-screen is another case where the
integrated graphics is not enough (I was talking about a single big
screen, which needs a single dual-link DVI connection).

Bernd Paysan

unread,

Jan 1, 2009, 12:19:53 PM1/1/09

to

Anton Ertl wrote:

> Bernd Paysan <bernd....@gmx.de> writes:
>>AMD already has a solution for everyone but the gamers: Their chipset
>>integrated graphics is quite good.
>
> Concerning Photo/Video, no discrete graphics card is needed for
> photos, and for video the only hardware support in graphics chips that
> I know of is decoding support, and some integrated graphics solutions
> can do that, too.
>
> Concerning "everyone but the gamers", I see at least two other markets:
>
> * Server boards typically have a separate low-end graphics chip
> on-board, to avoid wasting main memory bandwidth for video output,
> i.e., to maximize CPU performance.

But servers usually have a low-res output, e.g. a small 800x600 TFT display
(rack space is limited), and on the server I maintain, that's usually
blanked.

> Likewise, I guess that
> workstations will typically use a discrete graphics card for the same
> reason even if only low or no 3D performance is required.

For workstations, a discrete graphics card makes sense. While the memory
bandwidth is reduced by up to 20% (all DVI links at maximum speed), most
benchmark results however tell you that the IGP doesn't eat much
performance. I see a 10% improvement of my memory benchmark with xset dpms
force off.

> * People who use a 2560x1600 display (ok, photo production people
> might be in that category). Chipset graphics typically don't support
> dual-link DVI (or at least I guess so, I have a hard time finding that
> in the specs), so either one needs to buy a discrete graphics card, a
> display with a DisplayPort (there are some motherboards with
> DisplayPort; hopefully they support 2560x1600), or a Display-Port to
> Dual-Link DVI adapter (which does not seem to exist yet). The
> discrete graphics card is probably the cheapest of these options.

AMD's integrated graphic chip (780G, HD 3200) supports display port; I own
one, but I don't know if the DVI port on my board is dual link or not
(don't need it, the manual is too shy to tell, as well ;-). The TDMS
interface integrated into the 780G chipset is dual-link capable, whether
that's used to drive the VGA as one link and one link on the DVI-D or not:
I don't know. If I buy a new 30" monitor, it will probably one with
DisplayPort anyway (since that won't happen anytime soon ;-).

Terje Mathisen

unread,

Jan 1, 2009, 1:16:54 PM1/1/09

to

Stephen Fuld wrote:

> Terje Mathisen wrote:
>> There is nothing stopping Intel from making hybrid chips, with maybe
>> 1-4 Core2 class "fast" cores and 30-60 LRB-style
>> power-efficient/throughput-optimized cores.
>
> Yes, but as Anton pointed out, the OSs will take time to figure out how
> to deal with this well.

Obviously:

Various bleeding-edge Linux versions and probably *BSD as well will
support such a hybrid chip from day one, while Microsoft's Windows 7 (or
8, 9 etc) will need major surgery and significant lead time.

EricP

unread,

Jan 1, 2009, 4:13:13 PM1/1/09

to

Terje Mathisen wrote:
>
> The size of the screen working set is still growing, but I believe it
> grows slower than the total cache size that can fit on a single cpu, and
> 95% or more of all users will be quite happy with 60 Hz full HD
> playback, or playing games at similar resolution and frame rates.
>
> The first-generation LRB chips are supposed to be similar to Cell, with
> maybe 256 KB RAM on each core: With 32 cores that is a total of 8 MB.
>
> Another few powers of two, and the working set of most high-end games
> will indeed fit nicely.

How much memory does an H264 HD decode take?
It looks like it can reference blocks in up to 16 other
frames and that it takes (I'm not sure) 16 bits per pixel,
so about 66.35 MB for just the frame buffer. Is that about right?

Eric

Niels Jørgen Kruse

unread,

Jan 1, 2009, 4:18:51 PM1/1/09

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:

> * People who use a 2560x1600 display (ok, photo production people
> might be in that category). Chipset graphics typically don't support
> dual-link DVI (or at least I guess so, I have a hard time finding that
> in the specs), so either one needs to buy a discrete graphics card, a
> display with a DisplayPort (there are some motherboards with
> DisplayPort; hopefully they support 2560x1600), or a Display-Port to
> Dual-Link DVI adapter (which does not seem to exist yet). The
> discrete graphics card is probably the cheapest of these options.

Apple recently started shipping a Mini-DisplayPort to Dual-Link DVI
adapter. Early costumers are complaining about glitches after hours of
use, so it is not fully baked at the moment.

The MacBook support 2560x1600 with its integrated graphics. Odds are
good that the next Mac mini will too.

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark

Anton Ertl

unread,

Jan 1, 2009, 4:29:53 PM1/1/09

to

nos...@ab-katrinedal.dk (=?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?=) writes:

>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> Display-Port to
>> Dual-Link DVI adapter (which does not seem to exist yet).

...

>Apple recently started shipping a Mini-DisplayPort to Dual-Link DVI
>adapter.

Yes, but I guess there is no DP to MiniDP adapter, so this won't help
people with motherboards with a DP. Also, it costs USD 99, more than
a discrete graphics card that supports dual-link DVI.

Anton Ertl

unread,

Jan 1, 2009, 4:59:30 PM1/1/09

to

Bernd Paysan <bernd....@gmx.de> writes:

>Anton Ertl wrote:
>> * Server boards typically have a separate low-end graphics chip
>> on-board, to avoid wasting main memory bandwidth for video output,
>> i.e., to maximize CPU performance.
>
>But servers usually have a low-res output, e.g. a small 800x600 TFT display
>(rack space is limited), and on the server I maintain, that's usually
>blanked.

Yes, and ours are typically in text mode (where I suspect even lower
bandwidth requirements than for low-res graphics). Still, server
boards have separate graphics, and certainly we liked that about the
boards when we made buying decisions.

>> * People who use a 2560x1600 display (ok, photo production people
>> might be in that category). Chipset graphics typically don't support
>> dual-link DVI (or at least I guess so, I have a hard time finding that
>> in the specs), so either one needs to buy a discrete graphics card, a
>> display with a DisplayPort (there are some motherboards with
>> DisplayPort; hopefully they support 2560x1600), or a Display-Port to
>> Dual-Link DVI adapter (which does not seem to exist yet). The
>> discrete graphics card is probably the cheapest of these options.
>
>AMD's integrated graphic chip (780G, HD 3200) supports display port

Yes, but most 2560x1600 displays don't (especially the cheaper ones).

Boon

unread,

Jan 1, 2009, 5:38:56 PM1/1/09

to

Terje Mathisen wrote:
> Stephen Fuld wrote:
>> Terje Mathisen wrote:
>>> There is nothing stopping Intel from making hybrid chips, with maybe
>>> 1-4 Core2 class "fast" cores and 30-60 LRB-style
>>> power-efficient/throughput-optimized cores.
>>
>> Yes, but as Anton pointed out, the OSs will take time to figure out
>> how to deal with this well.
>
> Obviously:
>
> Various bleeding-edge Linux versions

By "Linux version" do you mean git trees of the Linux kernel ?

Bernd Paysan

unread,

Jan 1, 2009, 6:31:43 PM1/1/09

to

Anton Ertl wrote:
>>AMD's integrated graphic chip (780G, HD 3200) supports display port
>
> Yes, but most 2560x1600 displays don't (especially the cheaper ones).

ASUS seems to be a bit more verbose on dual-link capability. Look into the
specifications of these 780G boards here, most are dual-link DVI capable:

http://www.asus.com/products.aspx?l1=3&l2=149&l3=639

Another, more recent board from ASRock, the A780GXE/128M has actually a
dual-link DVI specified, and also comes with 128MB dedicated graphics
memory (but no display port). The specification for dual-link is from a
third-party site, ASRock seems to be incapable of knowing what they do. c't
12/2008 also claims the A780FullDisplayPort I have has a dual-link DVI,
either (it is a dual-link DVI-D connector, but that says nothing ;-).

Stephen Fuld

unread,

Jan 2, 2009, 2:24:05 AM1/2/09

to

Terje Mathisen wrote:
> Stephen Fuld wrote:
>> Terje Mathisen wrote:
>>> There is nothing stopping Intel from making hybrid chips, with maybe
>>> 1-4 Core2 class "fast" cores and 30-60 LRB-style
>>> power-efficient/throughput-optimized cores.
>>
>> Yes, but as Anton pointed out, the OSs will take time to figure out
>> how to deal with this well.
>
> Obviously:
>
> Various bleeding-edge Linux versions and probably *BSD as well will
> support such a hybrid chip from day one, while Microsoft's Windows 7 (or
> 8, 9 etc) will need major surgery and significant lead time.

I've been thinking about this a little. If the number of "threads"
(i.e. units of work requesting CPU time whether they be single thread
processes or threads of a multi-thread process) is =< the number of fast
processors, then it seems trivial. But if the number of requesters is
greater than the number of fast processors, is there a general algorithm
for the OS to decide (without any information from the requesters
themselves) which ones are given time on the fast processors? ISTM this
could get complex very quickly and I don't see an easy solution.

Jean-Marc Bourguet

unread,

Jan 2, 2009, 3:12:23 AM1/2/09

to

Stephen Fuld <S.F...@PleaseRemove.att.net> writes:

Migrate to the fast processors the process which use their full time slice
without yielding it back -- and migrate of those which aren't consuming
theirs -- seems an obvious first approach. It's in fact the same processes
which get benefit from a longer time slice but less often.

Yours,

--
Jean-Marc

Jasen Betts

unread,

Jan 2, 2009, 5:44:28 AM1/2/09

to

On 2009-01-01, Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> Kai Harrekilde-Petersen <k...@harrekilde.dk> writes:
>>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>>> Concerning "everyone but the gamers", I see at least two other markets:
>>>
>>> * Server boards typically have a separate low-end graphics chip
>>> on-board, to avoid wasting main memory bandwidth for video output,
>>> i.e., to maximize CPU performance.
>>
>>Would that really be necessary, from a performance point of view?
>>I mean, would they even *bother*, considering cost, power, etc?
>
> Chipset manufacturers don't build graphics into chipsets intended for
> servers,
> and motherboard manufacturers don't produce server
> motherboards with chipsets that integrate graphics.

sure they do! I wouldn't wouldn't want to waste a rare resource (slots)
in a 1U server. (1U servers are a whole lot cheaper to rent space for)
especially when noone is going to even connect a display to it unless
something goes seriously wrong with it. it doesn't need to be anything
flash it'll probably never leave 80x25 (vga compatible) text mode.

http://www.asus.co.nz/products.aspx?l1=9&l2=40&l3=116&l4=0&model=1476&modelmenu=2

Jasen Betts

unread,

Jan 2, 2009, 5:53:58 AM1/2/09

to

On 2009-01-01, Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:

> Bernd Paysan <bernd....@gmx.de> writes:
>>Anton Ertl wrote:
>>> * Server boards typically have a separate low-end graphics chip
>>> on-board, to avoid wasting main memory bandwidth for video output,
>>> i.e., to maximize CPU performance.
>>
>>But servers usually have a low-res output, e.g. a small 800x600 TFT display
>>(rack space is limited), and on the server I maintain, that's usually
>>blanked.
>
> Yes, and ours are typically in text mode (where I suspect even lower
> bandwidth requirements than for low-res graphics).

std VGA text mode: 720x480 pixels (not including overscan) 73hz vertical refresh
33.5 MHz pixel clock (iirc) uses 8kb video ram, 4K display buffer 4K font buffer.
(linux uses 16K display buffer with a sliding window for more efficient scrolling)

> Still, server boards have separate graphics

some do some don't

Anton Ertl

unread,

Jan 2, 2009, 6:11:07 AM1/2/09

to

Jasen Betts <ja...@xnet.co.nz> writes:
>On 2009-01-01, Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> Chipset manufacturers don't build graphics into chipsets intended for
>> servers,
>> and motherboard manufacturers don't produce server
>> motherboards with chipsets that integrate graphics.
>
>sure they do!

Example?

> I wouldn't wouldn't want to waste a rare resource (slots)
>in a 1U server.

That does not mean that the server uses chipset graphics.

>http://www.asus.co.nz/products.aspx?l1=9&l2=40&l3=116&l4=0&model=1476&modelmenu=2

This server does not have chipset graphics. Instead it has a separate
graphics chip (XGI Z7), and, more importantly, separate graphics
memory (32MB DDR).

Anton Ertl

unread,

Jan 2, 2009, 6:17:06 AM1/2/09

to

Stephen Fuld <S.F...@PleaseRemove.att.net> writes:
>I've been thinking about this a little. If the number of "threads"
>(i.e. units of work requesting CPU time whether they be single thread
>processes or threads of a multi-thread process) is =< the number of fast
>processors, then it seems trivial. But if the number of requesters is
>greater than the number of fast processors, is there a general algorithm
>for the OS to decide (without any information from the requesters
>themselves) which ones are given time on the fast processors?

Maybe schedule processes with a single thread on a fast core and and
processes with many threads on several slower cores.

Other things that can be considered:

* nice level.

* proportion of cache misses. Or maybe also IPC.

Jasen Betts

unread,

Jan 2, 2009, 6:00:15 AM1/2/09

to

the OS already has a way to prioritise tasks, best solution is
probably to put the tasks with the highest priority on the fastest
"silicon"

Thomas Womack

unread,

Jan 2, 2009, 8:15:59 AM1/2/09

to

In article <2009Jan...@mips.complang.tuwien.ac.at>,

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>Stephen Fuld <S.F...@PleaseRemove.att.net> writes:
>>I've been thinking about this a little. If the number of "threads"
>>(i.e. units of work requesting CPU time whether they be single thread
>>processes or threads of a multi-thread process) is =< the number of fast
>>processors, then it seems trivial. But if the number of requesters is
>>greater than the number of fast processors, is there a general algorithm
>>for the OS to decide (without any information from the requesters
>>themselves) which ones are given time on the fast processors?
>
>Maybe schedule processes with a single thread on a fast core and and
>processes with many threads on several slower cores.
>
>Other things that can be considered:
>
>* nice level.

One of the recent ubuntus uses nice level to determine when to turn
CPU speeds down, which makes some sort of sense. Except that I use
nice -19 to run background jobs that I want to run full-speed 24/7 at
all times when I'm not doing anything else, and get scheduled back
when I want to use CPUs, and running at 66% full speed for 1.4 times
as long (the effectively- faster RAM at the slower speed helps
slightly) is not power-efficient because the idle power doesn't scale.

Tom

Terje Mathisen

unread,

Jan 2, 2009, 8:39:28 AM1/2/09

to

h264 can in theory look pretty much anywhere, but in reality this
"never" happens, at least not with current encoders.

What I've seen is quite similar to regular DVD (not too surprising,
right?) with two reference frames used to interpolate the current frame.

The actual working set is based on the 8-bit luminance data (the color
info is normally sub-sampled) where you need 2 MB for a single
progressive 1920x1080 frame, and another 2+2=4 MB for the two reference
frames, i.e. 6MB.

This is BTW the size of cache in some recent Core2 versions. :-)

OTOH, Blu-Ray is much easier: Sony knew that they had to make it work on
the PS3 with 256 KB per cell core, so BR video is encoded as 4
independent quadrants, reducing the frame buffer to 0.5 MB, and since
current BR disks are 1080i, it fits perfectly.

I've read one paper about an actual Cell-based BR decoder: It seems like
they build the current frame inside the core ram, then globally schedule
the different cores so that they can do the reference frame lookups
using the DMA engine without overcommitting available bandwidth.

Terje Mathisen

unread,

Jan 2, 2009, 8:40:39 AM1/2/09

to

Probably there first, then enthusiast kernel builds.

Terje Mathisen

unread,

Jan 2, 2009, 8:47:58 AM1/2/09

to

The speed problem is relatively easy: Threads that don't use a lot of
cpu don't need a fast core (at least not by default).

OTOH, if a thread need Larrabee-style 16-wide vector math, then it
obviously has to run on such a core, independent of the cpu work/sleep
ratio.

You would do it based on capability bits for each core and requirement
bits for each executable.

The req bits could be dynamic (detected by the unsupported opcode fault
handler) or static (compiled-in).

The key problem is to avoid having all threads migrate to the most
capable core, because they all, sooner or later, call a library function
that require a specific feature.

A message-passing kernel would be easier to map this way, since you
could more easily keep library code on a different core from the
mainline program code.

Anton Ertl

unread,

Jan 2, 2009, 9:13:23 AM1/2/09

to

Thomas Womack <two...@chiark.greenend.org.uk> writes:
>One of the recent ubuntus uses nice level to determine when to turn
>CPU speeds down, which makes some sort of sense. Except that I use
>nice -19 to run background jobs that I want to run full-speed 24/7 at
>all times when I'm not doing anything else,

Actually it's the other way down: The ondemand governor does not
consider nice jobs when determining whether to raise the clock rate or
keep it high. You can change that with

for i in /sys/devices/system/cpu/cpu*/cpufreq/ondemand/ignore_nice_load; do echo 0 >$i; done

>running at 66% full speed for 1.4 times
>as long (the effectively- faster RAM at the slower speed helps
>slightly) is not power-efficient because the idle power doesn't scale.

If you load the machine 24/7, then it's more a question of how often
you want to get results. A lower clock will consume less power (but
not per result), whereas a higher clock will give you a higher result
rate. It certainly is not power-efficient to buy more machines and
run them at the lower clock rate.

If you don't have load all the time, it depends on whether you leave
the computer running idle for the rest of the time or turn it off. On
all machines I have measured it's at least as efficient to run it at a
lower speed with less idle time than at the higher speed with more
idle time (contrary to the "race to idle" strategy that some people
advocate).

If you turn the machine off after the job is done, and if the
processes have decent frequency-scaling, then it's more efficient to
run at the higher speeds.

My data is found on
<http://www.complang.tuwien.ac.at/anton/computer-power-consumption.html>.

Terje Mathisen

unread,

Jan 2, 2009, 10:52:11 AM1/2/09

to

Anton Ertl wrote:
> rate. It certainly is not power-efficient to buy more machines and
> run them at the lower clock rate.

That is almost certainly false!

All the recent cpus I've seen power/performance curves for have higher
performance/watt when throttled a bit down from the maximum frequency.

I assume you meant total system power, but that seems to be false as well?

Stephen Fuld

unread,

Jan 2, 2009, 11:12:20 AM1/2/09

to

I think this has some problems. For example, what if more tasks are
compute bound than you have fast processors? Also I guess you could
have a situation where you were running something like protein folding
at home which is a background task that soaks up all available CPU time.
A naive algorithm could then schedule it on the fast core when in
fact it doesn't contribute the higher priority work the customer really
wants done.

Anton Ertl

unread,

Jan 2, 2009, 11:11:13 AM1/2/09

to

Terje Mathisen <terje.m...@hda.hydro.com> writes:
>Anton Ertl wrote:
>> rate. It certainly is not power-efficient to buy more machines and
>> run them at the lower clock rate.
>
>That is almost certainly false!
>
>All the recent cpus I've seen power/performance curves for have higher
>performance/watt when throttled a bit down from the maximum frequency.
>
>I assume you meant total system power, but that seems to be false as well?

Even if you look only at power consumption during usage, I have yet to
see a machine where total system power rises faster than clock rate.
My data is in
<http://www.complang.tuwien.ac.at/anton/computer-power-consumption.html>

But you have to add in the power needed to build the systems, too, and
that does not change if you clock the machine lower; that shifts the
balance even more in favour of running fewer machines faster.

Jean-Marc Bourguet

unread,

Jan 2, 2009, 11:27:03 AM1/2/09

to

Stephen Fuld <S.F...@PleaseRemove.att.net> writes:

> Jean-Marc Bourguet wrote:

>> Migrate to the fast processors the process which use their full time slice
>> without yielding it back -- and migrate of those which aren't consuming
>> theirs -- seems an obvious first approach. It's in fact the same processes
>> which get benefit from a longer time slice but less often.
>
> I think this has some problems. For example, what if more tasks are
> compute bound than you have fast processors?

Have them having turn on the slow ones.

> Also I guess you could have a situation where you were running something
> like protein folding at home which is a background task that soaks up all
> available CPU time. A naive algorithm could then schedule it on the fast
> core when in fact it doesn't contribute the higher priority work the
> customer really wants done.

Such process would be niced -- and so the scheduler would have an hint that
it is the one to put preferably on the slow processors.

If your point is that it is difficult to schedule correctly an unknown
workload which nearly saturate the available ressources, and that the
problem is worse if the ressources aren't homogenous, we totally agree.

Yours,

--
Jean-Marc

Stephen Fuld

unread,

Jan 2, 2009, 12:59:45 PM1/2/09

to

Jean-Marc Bourguet wrote:
> Stephen Fuld <S.F...@PleaseRemove.att.net> writes:
>
>> Jean-Marc Bourguet wrote:
>
>>> Migrate to the fast processors the process which use their full time slice
>>> without yielding it back -- and migrate of those which aren't consuming
>>> theirs -- seems an obvious first approach. It's in fact the same processes
>>> which get benefit from a longer time slice but less often.
>> I think this has some problems. For example, what if more tasks are
>> compute bound than you have fast processors?
>
> Have them having turn on the slow ones.

Yes, that would work. But note that you would have to keep the rate of
"turn taking" be rather slow to minimize the cost of process migration.

>> Also I guess you could have a situation where you were running something
>> like protein folding at home which is a background task that soaks up all
>> available CPU time. A naive algorithm could then schedule it on the fast
>> core when in fact it doesn't contribute the higher priority work the
>> customer really wants done.
>
> Such process would be niced -- and so the scheduler would have an hint that
> it is the one to put preferably on the slow processors.
>
>
> If your point is that it is difficult to schedule correctly an unknown
> workload which nearly saturate the available ressources, and that the
> problem is worse if the ressources aren't homogenous, we totally agree.

Yes, I agree that is the heart of it. From the replies, there are a lot
of heuristics and some reliance on information provided by the programs,
but no clean, simple optimal solutions.

EricP

unread,

Jan 2, 2009, 1:01:51 PM1/2/09

to

Jean-Marc Bourguet wrote:
>
> Migrate to the fast processors the process which use their full time slice
> without yielding it back -- and migrate of those which aren't consuming
> theirs -- seems an obvious first approach. It's in fact the same processes
> which get benefit from a longer time slice but less often.

The problem with this is that it tends to put the
long, grinding compute bound tasks on the fastest
cpu and the short, real time tasks on the slow cpu.
So the video playback which contains lots of timer waits
winds up on the slow cpu, and seti@home on the fast one.

You need some concept of work class:
- Interrupts
- Deferred interrupt work (WNT DPC's or Linux BottomHalfs)
- Soft real-time (non time slice) threads by priority
- Application (time sliced) threads by priority

The interrupts, DPCs and real time threads should all be
done as fast as possible. Running the app threads that yield
often on the fastest cpu might make the system more responsive.

Eric

Jasen Betts

unread,

Jan 2, 2009, 3:29:04 PM1/2/09

to

On 2009-01-02, Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> Jasen Betts <ja...@xnet.co.nz> writes:
>>On 2009-01-01, Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>> Chipset manufacturers don't build graphics into chipsets intended for
>>> servers,
>>> and motherboard manufacturers don't produce server
>>> motherboards with chipsets that integrate graphics.
>>
>>sure they do!
>
> Example?
>
>> I wouldn't wouldn't want to waste a rare resource (slots)
>>in a 1U server.
>
> That does not mean that the server uses chipset graphics.
>
>>http://www.asus.co.nz/products.aspx?l1=9&l2=40&l3=116&l4=0&model=1476&modelmenu=2
>
> This server does not have chipset graphics. Instead it has a separate
> graphics chip (XGI Z7), and, more importantly, separate graphics
> memory (32MB DDR).

ah, Now I see what you mean. sorry.

Niels Jørgen Kruse

unread,

Jan 2, 2009, 3:50:21 PM1/2/09

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:

> nos...@ab-katrinedal.dk (=?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?=) writes:
> >Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> >> Display-Port to
> >> Dual-Link DVI adapter (which does not seem to exist yet).
> ...
> >Apple recently started shipping a Mini-DisplayPort to Dual-Link DVI
> >adapter.
>
> Yes, but I guess there is no DP to MiniDP adapter, so this won't help
> people with motherboards with a DP. Also, it costs USD 99, more than
> a discrete graphics card that supports dual-link DVI.

The existence of the adapter show that the silicon needed is available,
so somebody else could come out with a DP to dual-link DVI any time. The
difference in connector is trivial.

Jasen Betts

unread,

Jan 2, 2009, 3:51:58 PM1/2/09

to

On 2009-01-02, Stephen Fuld <S.F...@PleaseRemove.att.net> wrote:

>> If your point is that it is difficult to schedule correctly an unknown
>> workload which nearly saturate the available ressources, and that the
>> problem is worse if the ressources aren't homogenous, we totally agree.
>
> Yes, I agree that is the heart of it. From the replies, there are a lot
> of heuristics and some reliance on information provided by the programs,
> but no clean, simple optimal solutions.

there are no simple optimal scheduling solutions, if you had one you'd have
a simple solution to the halting problem.

most schedulers make the assumption that history is an indicator of
the future, and this, while sub-optimal works fairly well.

Jasen Betts

unread,

Jan 2, 2009, 4:44:33 PM1/2/09

to

On 2009-01-02, EricP <ThatWould...@thevillage.com> wrote:
> Jean-Marc Bourguet wrote:
>>
>> Migrate to the fast processors the process which use their full time slice
>> without yielding it back -- and migrate of those which aren't consuming
>> theirs -- seems an obvious first approach. It's in fact the same processes
>> which get benefit from a longer time slice but less often.
>
> The problem with this is that it tends to put the
> long, grinding compute bound tasks on the fastest
> cpu and the short, real time tasks on the slow cpu.
> So the video playback which contains lots of timer waits
> winds up on the slow cpu, and seti@home on the fast one.

if the slow cpu is capable that seems like an optimal solution.

> You need some concept of work class:
> - Interrupts
> - Deferred interrupt work (WNT DPC's or Linux BottomHalfs)
> - Soft real-time (non time slice) threads by priority
> - Application (time sliced) threads by priority
>
> The interrupts, DPCs and real time threads should all be
> done as fast as possible. Running the app threads that yield
> often on the fastest cpu might make the system more responsive.

a few milliseconds is unlikely to make a large difference

Morten Reistad

unread,

Jan 5, 2009, 1:27:50 PM1/5/09

to

In article <E1r7l.250857$Mh5.2...@bgtnsc04-news.ops.worldnet.att.net>,

Isn't this what the priority (PR in top, PRI in ps in Linux) field is for?

It is adaptive to cpu usage, and depends on the nice value.

Sort on priority, assign runnable processes to fastest processors first,
in order of priority, but keep the cpu affinity local for the millisecond
range of time to avoid too much cache-thashing.. This should fit
the bill well; given correct gradients and biases on the priority from
the start. This is more or less what Linux 2.6 does already. You just
need to assign the fast iron first.

This way, a fast process will stay on the fast processor until the
processor loading exceeds a given threshold, and then migrates off
to a slower one.

-- mrr

Terje Mathisen

unread,

Jan 5, 2009, 3:35:24 PM1/5/09

to

The problem comes back when the cpus have disjoint capabilities, i.e. on
a hybrid Core2/LRB cpu only the LRB cores will handle 16-wide vectors,
while the Core2 can do SSE4 etc.

If the software is adaptive (using CPUID to determine current
capabilities), it might run on both kinds, and since it uses 100% of the
cpu time available on the Core2 core, it would not be moved to an LRB
core, even though it could run 2-4 times faster there.

This means that LRB-specific code needs a way to indicate that it would
really like to be scheduled on such a core!

The easy solution would be a try block containing a single LRB opcode,
and then fall back to SSE code after an exception (or two?) :-)

A more likely alternative is for CPUID on a hybrid chip to always return
the sum of all supported features plus another mask which indicates what
the current core can do.

Gavin Scott

unread,

Jan 5, 2009, 4:30:48 PM1/5/09

to

Terje Mathisen <terje.m...@hda.hydro.com> wrote:
> If the software is adaptive (using CPUID to determine current
> capabilities), it might run on both kinds,

This seems difficult to do if you can switch cores as a result of a
task switch at any arbitrary point.

> This means that LRB-specific code needs a way to indicate that it would
> really like to be scheduled on such a core!

> The easy solution would be a try block containing a single LRB opcode,
> and then fall back to SSE code after an exception (or two?) :-)

Well, what about just having the OS "unimplemented" trap handler mark
the thread as needing to be launched on a core that includes that
functionality? This adaptive method would probably be more accurate
than trusting the process itself to know what it needs.

G.

Morten Reistad

unread,

Jan 5, 2009, 5:56:20 PM1/5/09

to

In article <GdWdndAPO7-Q8v_U...@giganews.com>,

Terje Mathisen <terje.m...@hda.hydro.com> wrote:
>Morten Reistad wrote:
>> In article <E1r7l.250857$Mh5.2...@bgtnsc04-news.ops.worldnet.att.net>,
>> Stephen Fuld <S.F...@PleaseRemove.att.net> wrote:
>>> Jean-Marc Bourguet wrote:
>>>> Stephen Fuld <S.F...@PleaseRemove.att.net> writes:

[snip]

It seems we need a couple of system calls here; and a restriction on the
scheduler; but otherwise all the data is in the Linux kernel already.
(They may be there already; I haven't read kernel lists for a few months)

If you are to schedule on a subset of the available resources you just
need a mask; the scheduling problem remains the same, or simpler.
You just schedule on the union of the task and processor requirements.

You may get into problems resolving a network if there are interlocks
among the capabilities; but for simple unions it should be straightforward.

Note that threads within a process can have different attributes, and
therefore end on different processors. You may therefore want to have
capability requirements sorted per thread.

All of this went into Linux 2.6 sometime before 2.6.9 with the clock
adjustment code. The kernel already schedules intelligently between
different speed and e.g. hyperthreading and non-hyperthreading processors.

Linux does use bogomips as the indicator of cpu speed, so the processors
shouldn't be so diverse that there is a large divergence in bogomips
vs real speed.

It should be sufficient to have a "get capabilites" and a "require
capabilities"; you do the get, see if LRB, SSE or other code you really
need is there, and in case you set "require capabilites" for the thread.
After that it does not get scheduled on a non-capable cpu.

-- mrr

Wes Felter

unread,

Jan 5, 2009, 7:28:55 PM1/5/09

to

Stephen Fuld wrote:

> I've been thinking about this a little. If the number of "threads"
> (i.e. units of work requesting CPU time whether they be single thread
> processes or threads of a multi-thread process) is =< the number of fast
> processors, then it seems trivial. But if the number of requesters is
> greater than the number of fast processors, is there a general algorithm
> for the OS to decide (without any information from the requesters
> themselves) which ones are given time on the fast processors? ISTM this
> could get complex very quickly and I don't see an easy solution.

There has been some previous work on this topic:
http://portal.acm.org/citation.cfm?id=1028176.1006707

Wes Felter

EricP

unread,

Jan 6, 2009, 12:08:06 AM1/6/09

to

Yeah, this look to me as just a variation on the
existing thread-processor affinity mechanism -
each thread has a bit mask of allowed processors.
This is implemented by each processor scheduler scanning
the ready queue for threads with the its bit set.

Other problems include deciding when to move a thread as
its cache on its last processor may or may not have value.
So you want it to be 'sticky' but not too sticky.
And you have to watch out for priority inversion
that can result from the stickiness.

I was thinking the cache was like capacitor charge and discharge
so there is a 'soft affinity' that makes a thread stick
to a processor for a certain time after it runs.

But you can wind up with idle cpus while threads are
in the ready queue due to stickiness (I believe WNT had
this problem at one point. I don't know if they fixed it.
Probably not.)

Also if the thread is running, as opposed to ready,
then there is the added cost of an IPI (inter processor
interrupt) to force the thread off its processor
so it can move to a faster one.
My intuition says that forcing a high priority thread
off a slow cpu so it can run on a fast cpu is unnecessary.

That simplifies the scheduler design but can result
in "priority inversion" with high priority threads on
low speed cpus. However forcing a thread off a cpu can
still happen if a higher priority thread is pending and a
lower priority thread is running, but just not for the sole
purpose of moving it to a faster cpu.

This means the processor scheduler only has to deal with two
considerations, itself and the lowest priority running thread.
But it does not have to compute the optimal match of running
threads to cpus.

Eric

Stephen Fuld

unread,

Jan 6, 2009, 12:40:18 AM1/6/09

to

Probably. But at least on my current Windows system, I just checked and
the overwhelming majority of the tasks are at priority "normal". Note
that this includes most of the Windows services as well as the
"background" tasks that only get control once in a while to check for
updates, etc. So that mechanism seems to rely on the programmer putting
something in his program and understanding enough to be able to do it
intelligently. I am, at best, dubious that most software that people
either buy from the local store or Amazon, or download from some web
site will do this well enough for the system to rely on it.

EricP

unread,

Jan 6, 2009, 12:44:47 AM1/6/09

to

EricP wrote:
>
> This means the processor scheduler only has to deal with two
> considerations, itself and the lowest priority running thread.
> But it does not have to compute the optimal match of running
> threads to cpus.

by this I mean that the scheduler tables are global and
guarded by a spinlock. This makes it a potential critical
path and you want whatever this does to be simple and fast.

The algorithm must also not deadlock, and this means that
each processor must make it own decision for itself.
It cannot make decisions for others and send and IPI
because as soon as the lock is released a third processor
could lock the table and change it.

So all a processor can do is lock the table, decide
whether the highest pending priority thread is higher
than it own current thread, and if so switch.

Then it decides whether to send an IPI to the processor
running lowest priority thread. This will interrupt it
so it can lock the table and decide on its own
whether it should switch its thread.

This allows the schedule to propagate though the processors
in a system yet not deadlock because each make its own
simple decision.

Eric

Stephen Fuld

unread,

Jan 6, 2009, 12:46:47 AM1/6/09

to

I suppose that the CPU vendor could make the fast processor a superset
of the slower one. Remember, the faster CPU doesn't have to execute the
"special" (in your example 16-wide vectors) instructions faster than the
slower CPU executes them. It just has to execute them not so much
slower than the slow CPU does that the task gets done in about the same
time.

Stephen Fuld

unread,

Jan 6, 2009, 12:49:12 AM1/6/09

to

Thanks for the reference. But I let my ACM membership lapse when I
retired so I can't get the full text of the article. :-(

EricP

unread,

Jan 6, 2009, 12:53:19 AM1/6/09

to

Google is your friend...

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.89

Eric

Terje Mathisen

unread,

Jan 6, 2009, 6:33:06 AM1/6/09

to

Gavin Scott wrote:
> Terje Mathisen <terje.m...@hda.hydro.com> wrote:

>> The easy solution would be a try block containing a single LRB opcode,
>> and then fall back to SSE code after an exception (or two?) :-)
>
> Well, what about just having the OS "unimplemented" trap handler mark
> the thread as needing to be launched on a core that includes that
> functionality? This adaptive method would probably be more accurate
> than trusting the process itself to know what it needs.

Sure, that is what I expect the first Linux LRB kernel to do, and what I
suggested in my initial post about this subject.

Terje Mathisen

unread,

Jan 6, 2009, 6:39:08 AM1/6/09

to

For LRB this is pretty much impossible:

The Vec16 processor is so different from the rest of x86 (special
texture unpacking /swizzling/distribution inside each load,
three-operand instructions, 16-wide fused mul-acc) that either you
implement (nearly) all of it in hardware, or it will take up to an order
of magnitude longer to finish.

Ken Hagan

unread,

Jan 6, 2009, 7:33:58 AM1/6/09

to

On Mon, 05 Jan 2009 21:30:48 -0000, Gavin Scott <ga...@allegro.com> wrote:

> Well, what about just having the OS "unimplemented" trap handler mark
> the thread as needing to be launched on a core that includes that
> functionality? This adaptive method would probably be more accurate
> than trusting the process itself to know what it needs.

Definitely. With dynamic linking, the author probably can't know at
compile-time what will be required. Even at run-time, these "requirements"
need to "age" in some way because the requirements of long-lived threads
may change over time. (I may use a fancy feature once at thread startup
and never again thereafter.)

Morten Reistad

unread,

Jan 6, 2009, 9:27:22 AM1/6/09

to

In article <43597$4962e76f$45c49ea8$16...@TEKSAVVY.COM>,

EricP <ThatWould...@thevillage.com> wrote:
>Gavin Scott wrote:
>> Terje Mathisen <terje.m...@hda.hydro.com> wrote:

>>
>> Well, what about just having the OS "unimplemented" trap handler mark
>> the thread as needing to be launched on a core that includes that
>> functionality? This adaptive method would probably be more accurate
>> than trusting the process itself to know what it needs.
>
>Yeah, this look to me as just a variation on the
>existing thread-processor affinity mechanism -
>each thread has a bit mask of allowed processors.
>This is implemented by each processor scheduler scanning
>the ready queue for threads with the its bit set.
>
>Other problems include deciding when to move a thread as
>its cache on its last processor may or may not have value.
>So you want it to be 'sticky' but not too sticky.
>And you have to watch out for priority inversion
>that can result from the stickiness.

You want to avoid processor flapping. If it is time to
move, just do it. Just don't move again too soo. If such
conditions exists with the run queues, the whole process
balancing is out of whack and needs to be dampened anyway.

>I was thinking the cache was like capacitor charge and discharge
>so there is a 'soft affinity' that makes a thread stick
>to a processor for a certain time after it runs.
>
>But you can wind up with idle cpus while threads are
>in the ready queue due to stickiness (I believe WNT had
>this problem at one point. I don't know if they fixed it.
>Probably not.)
>
>Also if the thread is running, as opposed to ready,
>then there is the added cost of an IPI (inter processor
>interrupt) to force the thread off its processor
>so it can move to a faster one.
>My intuition says that forcing a high priority thread
>off a slow cpu so it can run on a fast cpu is unnecessary.

I would tend to agree. Processor affinites can be handled
at quantum end or I/O; and be done by the processor in question
handing off the code.

>That simplifies the scheduler design but can result
>in "priority inversion" with high priority threads on
>low speed cpus. However forcing a thread off a cpu can
>still happen if a higher priority thread is pending and a
>lower priority thread is running, but just not for the sole
>purpose of moving it to a faster cpu.

In some scenarios it makes sense for each high priority
job to have it's own, slower processor all by itself rather
than fight with the rest of the processes for timeslices
on a single/dual fast machines.

On the media coding benchmarks I have done system throughput
improved by giving the worker processes n-2 processors, and
run the rest on the remaining ones, like os, file systems,
shellls, process control etc.(12+ processors).

>This means the processor scheduler only has to deal with two
>considerations, itself and the lowest priority running thread.
>But it does not have to compute the optimal match of running
>threads to cpus.

I get a violent deja-vu feeling from this discussion. This
was beat to death on the kernel lists when the cpu throttle
control for mp systems made the mainstream a few years ago.

That problem is more complicated; how many processors to run
at full speed vs throttled; and whether to migrate the process
vs unthrottling the processor it is running on.

I suspect the real world will give systems that also has
cpu throttling and shutdown for such an asymmetric system, so
we have to take that into account as well.

-- mrr

Anton Ertl

unread,

Jan 6, 2009, 1:12:34 PM1/6/09

to

Morten Reistad <fi...@last.name> writes:
>All of this went into Linux 2.6 sometime before 2.6.9 with the clock
>adjustment code. The kernel already schedules intelligently between
>different speed and e.g. hyperthreading and non-hyperthreading processors.

I find what the Linux scheduler does less than intelligent even on
homogeneous-performance systems. E.g., with the 2.6.18 kernel:

Cpu(s): 50.0%us, 0.0%sy, 37.5%ni, 12.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
25769 ulrich 39 19 179m 3508 612 R 100 0.0 31357:51 pl
25169 cacao 25 0 98.1m 13m 5072 R 100 0.1 2359:02 java
11519 cacao 25 0 99424 13m 5072 R 100 0.1 2322:47 java
32324 cacao 25 0 99424 13m 5072 R 100 0.1 923:50.96 java
19374 cacao 25 0 99424 13m 5072 R 100 0.1 888:01.52 java
25755 ulrich 39 19 179m 3648 604 R 33 0.0 31143:02 pl
25758 ulrich 39 19 179m 4164 652 R 33 0.0 31428:22 pl
25761 ulrich 39 19 179m 3100 612 R 33 0.0 31144:48 pl
25772 ulrich 39 19 179m 9884 604 R 25 0.0 31304:01 pl
25775 ulrich 39 19 179m 3568 648 R 25 0.0 31812:42 pl
25778 ulrich 39 19 179m 4944 604 R 25 0.0 31491:55 pl
25764 ulrich 39 19 179m 4624 620 R 25 0.0 31418:02 pl

Note the 12.5% idle time even though there are 12 (single-threaded)
processes around (for 8 cores) that consume all the CPU time you give
them. So one of the cores is not used.

The 2.6.25 kernel seems better at utilization, but worse at observing
priorities:

Cpu(s): 12.3%us, 0.1%sy, 87.6%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10405 ulrich 39 19 198m 19m 1908 R 100 0.2 71726:13 pl
10407 ulrich 39 19 188m 7856 1908 R 100 0.1 71643:27 pl
10403 ulrich 39 19 188m 8508 1888 R 100 0.1 71707:13 pl
10396 ulrich 39 19 195m 14m 1908 R 51 0.2 71397:40 pl
30422 anton 20 0 3804 488 400 R 49 0.0 0:01.82 yes

This is a 4-core machine with 5 processes that consume all the CPU you
give them, four of them nice, and one of them regular. For quite some
time after the regular process is started, it does not get a full
core. Ok, the scheduler might want to avoid migrating a nice process
to another core, but even if the regular process has to share a core
with a nice process, it should get most (say 95%) of that core right
away. Eventually the regular process gets 100% of one CPU, but in
this case this took more than 1 minute. So having these nice
processes running really slows plain tasks down a lot, contrary to the
usual expectations.

Bernd Paysan

unread,

Jan 6, 2009, 4:07:48 PM1/6/09

to

Anton Ertl wrote:
> The 2.6.25 kernel seems better at utilization, but worse at observing
> priorities:
>
> Cpu(s): 12.3%us, 0.1%sy, 87.6%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si,
> 0.0%st
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 10405 ulrich 39 19 198m 19m 1908 R 100 0.2 71726:13 pl
> 10407 ulrich 39 19 188m 7856 1908 R 100 0.1 71643:27 pl
> 10403 ulrich 39 19 188m 8508 1888 R 100 0.1 71707:13 pl
> 10396 ulrich 39 19 195m 14m 1908 R 51 0.2 71397:40 pl
> 30422 anton 20 0 3804 488 400 R 49 0.0 0:01.82 yes

Looks better on 2.6.27. 4 cores (Phenom), yes >/dev/null used to load them:

13436 bernd 20 0 5072 648 544 R 90 0.0 0:11.42 yes
13434 bernd 20 0 5072 652 544 R 90 0.0 0:13.94 yes
13432 bernd 30 10 5072 648 544 R 83 0.0 0:16.92 yes
13430 bernd 30 10 5072 648 544 R 49 0.0 0:11.98 yes
13428 bernd 30 10 5072 648 544 R 43 0.0 0:11.00 yes
13426 bernd 30 10 5072 648 544 R 41 0.0 0:09.98 yes

The result is still not what I would think it should look like (two unniced
yes on one core each, the four niced ones sharing the two remaining cores).
I've also tried with nice -19,

13468 bernd 39 19 5072 648 544 R 99 0.0 0:12.84 yes
13474 bernd 20 0 5072 648 544 R 99 0.0 0:15.66 yes
13476 bernd 20 0 5072 648 544 R 99 0.0 0:15.36 yes
13472 bernd 39 19 5072 648 544 R 98 0.0 0:12.84 yes
13466 bernd 39 19 5072 648 544 R 1 0.0 0:15.40 yes
13470 bernd 39 19 5072 652 544 R 1 0.0 0:10.16 yes

which gives still a stunningly strange result. Second try, starting them all
at the same time:

13487 bernd 20 0 5072 648 544 R 100 0.0 0:28.52 yes
13488 bernd 20 0 5072 648 544 R 100 0.0 0:28.10 yes
13485 bernd 39 19 5072 648 544 R 69 0.0 0:14.42 yes
13483 bernd 39 19 5072 648 544 R 51 0.0 0:17.54 yes
13486 bernd 39 19 5072 648 544 R 48 0.0 0:13.80 yes
13484 bernd 39 19 5072 648 544 R 29 0.0 0:11.70 yes

Looks slightly more reasonable, but still not what I would expect from
a "completely fair scheduler". The four niced yes processes are equal, they
have been started at the same time, they should be given an equal share of
the available cores (i.e. at least the total run time should be all 14s).

IMHO, a completely fair scheduler algorithm should satisfy the following
conditions:

Calculate the amount of processing time granted for each process under ideal
assumptions (one core). Now on a n-core machine, if the process with
highest allowed time gets more than 100/n% of the available time, schedule
it on one core, and don't use that core for anything else, take both core
and process off the list (a process can't take more than 100% of one core),
and recalculate.

Moving processes from one core to the other should be avoided on the short
term, but in order to be fair, should be allowed on longer terms. How long
does migrating a process take? E.g. a Phenom has 10GB/s bandwidth from L3
to L2. So with a 512kB L2 cache, you can move a process from one core to
the other in 50µs or less. I.e. not really worth bothering on time slice
boundaries, only worth bothering if the process sleeps for a very short
period due to IO. So a sleeping process should have a declining core
affinity, which should prevent niced processes to start on that core until
the affinity declined below the nice level. If another process becomes
runable, find the core with the least affinity by its occupant - running
niced processes should have less affinity than a shortly sleeping normal
process.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Morten Reistad

unread,

Jan 6, 2009, 3:18:36 PM1/6/09

to

In article <2009Jan...@mips.complang.tuwien.ac.at>,

There are lots of parameters for the schedulers in Linux. Have you tried
tuning for faster responsiveness ?

See /usr/src/linux-source-`uname -r`/Documentation/sched*.txt

assuming you have sources installed.

-- mrr

Anton Ertl

unread,

Jan 6, 2009, 4:35:45 PM1/6/09

to

Bernd Paysan <bernd....@gmx.de> writes:
[scheduler on Linux 2.6.25 not great with a mix of nice and regular processes]
>Looks better on 2.6.27.

Yes; just tested on my Core 2 Duo home machine with 2 nice and one
regular process. However, looking at the load indicators of gkrellm,
both cores get varying amounts of non-nice loads, so the regular
process hops between cores all the time (and the nice ones probably,
too). Ok, on a Core 2 Duo processor affinity does not really help
(only the L1 caches are per-core), but I wonder if it's any different
on a CPU where it counts.

Anton Ertl

unread,

Jan 6, 2009, 4:46:41 PM1/6/09

to

Morten Reistad <fi...@last.name> writes:
>There are lots of parameters for the schedulers in Linux. Have you tried
>tuning for faster responsiveness ?

No. I expect the scheduler to work decently by default.

Message has been deleted

Bernd Paysan

unread,

Jan 6, 2009, 7:36:21 PM1/6/09

to

Andi Kleen wrote:

> Bernd Paysan <bernd....@gmx.de> writes:
>>
>> Looks slightly more reasonable, but still not what I would expect from
>> a "completely fair scheduler".
>

> Do you really believe all advertisement slogans? @)

Well, the CFS was a replacement for the staircase scheduler, which did not
advertise fairness in its name. So I expect the CFS to be indeed completely
fair by design (i.e. schedule by fairness instead of some other approach).

Apparently, it defines "fairness" differently from how I define it: It takes
the process that waited longest out of the queue. Niceness isn't taken into
account there, it is only taken into account when inserting processes into
the queue. I very much doubt that this can lead to any consistent picture
when other processes are allowed to run occasionally, too.

Things are a lot fairer when the running processes are all of the same nice
level, even when top is running. This is a strong indication that the way
CFS deals with different niceness is really in conflict with fairness.

>> The four niced yes processes are equal, they
>> have been started at the same time, they should be given an equal share
>> of the available cores (i.e. at least the total run time should be all
>> 14s).
>

> Anyways perfect fairness is hard. One thing that especially affects it
> is measurement of time. Unfortunately at least in the x86 world
> that can be sometimes hard to due to various quirks, mostly related
> to power management (my impression is often that accurate time keeping
> and power management are natural enemies at the hardware level, constantly
> fighting with each other @] Ok I exaggerate, but it sometimes looks
> this way to an outsider).

I don't understand why (I design hardware). Hit the designers of these CPUs
over the head with a big cluebat. ALL CPUs I know of have a common timing
source, e.g. hypertransport clock or FSB clock. All they need is to count
those ticks. If you turn off such a clock due to power-saving, it will be
only when all processes sleep - and then the inaccuracy in accounting due
to using a less accurate clock source does not matter (the hardware I'm
designing changes clocks frequently to save power, but the common counter
for the timer is incremented by larger steps so it keeps the same
resolution, even though accuracy is lower in power-saving modes. The power
consumption of my CPU is 6 orders of magnitude lower than those we discuss
here, and it still keeps a quite reasonable timing source).

> When there is no reasonable high performance high resolution timer the
> scheduler falls back to a lower resolution timer which impacts its
> results.

Can't be the case on a Phenom, dmesg tells me:

Switched to high resolution mode on CPU 0
Switched to high resolution mode on CPU 2
Switched to high resolution mode on CPU 1
Switched to high resolution mode on CPU 3

> The other is that running top will disturb the balancing significantly.

Well, it takes 1% percent of the load or so; it should not be scheduled on
the cores with the non-niced cycle-burner. And remember: The scheduler
advertises as completely fair. It should take those 1% off from the four
niced processes, from each of them exactly .25% ;-). Starting top later,
and just looking at the run-time also confirms that it is not top alone
which causes the disturbance (all processes started at the same time):

16830 bernd 20 0 5072 648 544 R 101 0.0 0:16.26 yes
16829 bernd 20 0 5072 648 544 R 99 0.0 0:16.32 yes
16827 bernd 39 19 5072 648 544 R 61 0.0 0:07.70 yes
16826 bernd 39 19 5072 648 544 R 55 0.0 0:12.82 yes
16825 bernd 39 19 5072 648 544 R 45 0.0 0:04.90 yes
16828 bernd 39 19 5072 648 544 R 38 0.0 0:07.08 yes

Nicing just by -3 doesn't help, it's still quite imbalanced:

16887 bernd 23 3 5072 648 544 R 91 0.0 0:15.72 yes
16892 bernd 20 0 5072 648 544 R 75 0.0 0:14.34 yes
16891 bernd 20 0 5072 652 544 R 67 0.0 0:12.06 yes
16890 bernd 23 3 5072 648 544 R 65 0.0 0:12.46 yes
16888 bernd 23 3 5072 648 544 R 50 0.0 0:09.64 yes
16889 bernd 23 3 5072 648 544 R 50 0.0 0:09.16 yes

Here one of the nice -3 yes processes got more CPU time than both of the
non-niced ones.

Stephen Fuld

unread,

Jan 6, 2009, 8:21:12 PM1/6/09

to

Thanks Eric. U should have thought of that :-(

I downloaded it and am going through it. In any event, it seems, from
the myriad of posts showing problems, that even the simpler problem of
homogeneous processors hasn't been solved well. Again :-(

Stephen Fuld

unread,

Jan 6, 2009, 10:51:53 PM1/6/09

to

Stephen Fuld wrote:

> Thanks Eric. U should have thought of that :-(

Sorry, of course that should have been *I* should have thought about that.

Anton Ertl

unread,

Jan 7, 2009, 9:44:38 AM1/7/09

to

Andi Kleen <fre...@alancoxonachip.com> writes:

>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>
>> Bernd Paysan <bernd....@gmx.de> writes:
>> [scheduler on Linux 2.6.25 not great with a mix of nice and regular processes]
>>>Looks better on 2.6.27.
>>
>> Yes; just tested on my Core 2 Duo home machine with 2 nice and one
>> regular process. However, looking at the load indicators of gkrellm,
>

>This might be not obvious, but running any active monitoring tools like
>top or gkrellm tends to disturb the process balance situation significantly
>because they also use a lot of CPU time and need to be scheduled too.

"A lot" is a little overstated. E.g., I'm just running top on a
2.66GHz Core 2 Duo, and I see (in top):

Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

I.e., top does not even show up in the statistics.

Anyway, of course top, gkrellm, and various other processes or kernel
threads that consume a little CPU now and then will affect the
scheduling. Still, the effects I showed should not happen in a decent
scheduler even if such little-CPU-usage processes are around.

E.g., in the example above (two nice and one regular CPU-hogging
process), if the scheduler tries to keep CPU affinity, I expect the
following: the regular process gets one core, and stays there all the
time. The two nice processes get the other core, and stay there all
the time; and if one of the small CPU users (like top, gkrellm, X, or
xmms) wakes up, I expect that to be scheduled on the same core as the
nice processes.

After all, a nice process (certainly at nice level 19, which I used)
should make way to regular processes in the usual case, so with only
1-2 regular processes being awake at the same time in the usual case,
and with enough CPU time left for nice processes, there's no need to
deschedule any runnable regular process.

nm...@cam.ac.uk

unread,

Jan 7, 2009, 12:34:02 PM1/7/09

to

In article <51ad36-...@vimes.paysan.nom>,
Bernd Paysan <bernd....@gmx.de> wrote:

>Andi Kleen wrote:
>
>> Anyways perfect fairness is hard. One thing that especially affects it
>> is measurement of time. Unfortunately at least in the x86 world
>> that can be sometimes hard to due to various quirks, mostly related
>> to power management (my impression is often that accurate time keeping
>> and power management are natural enemies at the hardware level, constantly
>> fighting with each other @] Ok I exaggerate, but it sometimes looks
>> this way to an outsider).
>
>I don't understand why (I design hardware). Hit the designers of these CPUs
>over the head with a big cluebat. ALL CPUs I know of have a common timing
>source, e.g. hypertransport clock or FSB clock. All they need is to count
>those ticks. If you turn off such a clock due to power-saving, it will be
>only when all processes sleep - and then the inaccuracy in accounting due
>to using a less accurate clock source does not matter (the hardware I'm
>designing changes clocks frequently to save power, but the common counter
>for the timer is incremented by larger steps so it keeps the same
>resolution, even though accuracy is lower in power-saving modes. The power
>consumption of my CPU is 6 orders of magnitude lower than those we discuss
>here, and it still keeps a quite reasonable timing source).

I understand why :-( It's because they needed to do it the way that
they do on the very early x86 CPUs, and haven't changed since. The
fact that they could improve the accuracy, reliability and functionality
by a vast factor (arguably hundreds to thousands of times) for ten times
less logic and a hundred times less software is ignored by the people
who make decisions.

Their heads are impervious to mere cluebats, unfortunately, but it is
important to realise that it is the OVERALL design that is broken and
not the design of individual components.

Regards,
Nick Maclaren.

nm...@cam.ac.uk

unread,

Jan 7, 2009, 12:44:45 PM1/7/09

to

In article <2009Jan...@mips.complang.tuwien.ac.at>,
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:

>Andi Kleen <fre...@alancoxonachip.com> writes:
>>
>>This might be not obvious, but running any active monitoring tools like
>>top or gkrellm tends to disturb the process balance situation significantly
>>because they also use a lot of CPU time and need to be scheduled too.
>

>"A lot" is a little overstated. ...

Indeed, though the effects may be more deastic than top indicates.

>Anyway, of course top, gkrellm, and various other processes or kernel
>threads that consume a little CPU now and then will affect the
>scheduling. Still, the effects I showed should not happen in a decent
>scheduler even if such little-CPU-usage processes are around.

Yes and no. This is an insoluble problem, and any scheduler will
misbehave on at least some reasonable and common workloads.

>E.g., in the example above (two nice and one regular CPU-hogging
>process), if the scheduler tries to keep CPU affinity, I expect the
>following: the regular process gets one core, and stays there all the
>time. The two nice processes get the other core, and stay there all
>the time; and if one of the small CPU users (like top, gkrellm, X, or
>xmms) wakes up, I expect that to be scheduled on the same core as the
>nice processes.

And what about kernel threads, including those actually used by the
kernel scheduler?

>After all, a nice process (certainly at nice level 19, which I used)
>should make way to regular processes in the usual case, so with only
>1-2 regular processes being awake at the same time in the usual case,
>and with enough CPU time left for nice processes, there's no need to
>deschedule any runnable regular process.

Nice has been broken for at least 20 years, since Berkeley redesigned
the old one, and every system that I have seen since has followed the
same path. There were good reasons for that - the old scheduler was
not flexible enough for GUIs, tasks with both kernel and user code,
and so on - but it is a great pity that nice wasn't scrapped when it
was broken.

The fundamental issue here is that you can't prioritise the subthreads
of an application over the system as a whole and expect it to work at
all well. You MUST use a proper hierarchy, as on mainframes, and that
hasn't been seen since VMS.

Regards,
Nick Maclaren.