Intel details future Larrabee graphics chip

NV55

unread,

Aug 4, 2008, 6:46:40 PM8/4/08

to

http://news.cnet.com/8301-13924_3-10005391-64.html

Intel has disclosed details on a chip that will compete directly with
Nvidia and ATI and may take it into unchartered technological and
market-segment waters.

Larrabee will be a stand-alone chip, meaning it will be very different
than the low-end--but widely used--integrated graphics that Intel now
offers as part of the silicon that accompanies its processors. And
Larrabee will be based on the universal Intel x86 architecture.

The first Larrabee product will be "targeted at the personal computer
market," according to Intel. This means the PC gaming market--putting
Nvidia and AMD-ATI directly into Intel's sights. Nvidia and AMD-ATI
currently dominate the market for "discrete" or stand-alone graphics
processing units.

http://i.i.com.com/cnwk.1d/i/bto/20080803/intel-larrabee-2-small.jpg

Larry Seiler (standing, middle), a senior Intel engineer, and Stephen
Junkins (sitting, right), an Intel graphics software architect, speak
at a briefing on Larrabee chip, due in 2009-2010.
(Credit: Brooke Crothers)

As Intel sees it, Larrabee combines the best attributes of a central
processing unit (CPU) with a graphics processor. "The thing we need is
an architecture that combines the full programmability of the CPU with
the kinds of parallelism and other special capabilities of graphics
processors. And that architecture is Larrabee," Larry Seiler, a senior
principal engineer in Intel's Visual Computing Group, said at a
briefing on Larrabee in San Francisco last week.

"It is not a GPU as many have mistakenly described it, but it can do
most graphics functions," Jon Peddie of Jon Peddie Research, said in
an article he posted Friday about Larrabee.

"It looks like a GPU and acts like a GPU but actually what it's doing
is introducing a large number of x86 cores into your PC," said Intel
spokesperson Nick Knupffer, alluding to the myriad ways Larrabee could
be used beyond just graphics processing. In addition to the PC, high-
performance computing and workstations are two potential markets that
were also mentioned.

Intel describes it in a statement as "the industry's first many-core
x86 Intel architecture." The chipmaker currently offers quad-core
processors and will offer eight-core processors based on its Nehalem
architecture, but Larrabee is expected to have dozens of cores and,
later, possibly hundreds.

The number of cores in each Larrabee chip may vary, according to
market segment. Intel showed a slide with core counts ranging from 8
to 48, claiming performance scales almost linearly as more cores are
added: that is, 16 cores will offer twice the performance of eight
cores.

The individual cores in Larrabee are derived from the Intel Pentium
processor and "then we added 64-bit instructions and multi-threading,"
Seiler said. Each core has 256 kilobytes of level-2 cache allowing the
size of the cache to scale with the total number of cores, according
to Seiler. And application programming interfaces (APIs) such as
Microsoft's DirectX and Apple's Open CL can be tapped. "Larrabee does
not require a special API. Larrabee will excel on standard graphics
APIs," he said. "So existing games will be able to run on Larrabee
products."

So, what is Larrabee's market potential? Today, the graphics chip
market is approaching 400 million units a year and has consolidated
into a handful of suppliers. "And of that population, two suppliers,
ATI and Nvidia, own 98 percent of the discrete GPU business."
according to Peddie.

"And the trend line indicates a flattening to decline in the
business...However, Intel is no light-weight start up, and to enter
the market today a company has to have a major infrastructure, deep IP
(intellectual property), and marketing prowess--Intel has all that and
more," Peddie said.

http://i.i.com.com/cnwk.1d/i/bto/20080803/intel-larrabee-explanation-slide-small.jpg

Larrabee combines aspects of a CPU and GPU
(Credit: Intel)

Though more details will be provided at Siggraph 2008, some key
Larrabee features:

Larrabee programming model: supports a variety of highly parallel
applications, including those that use irregular data structures. This
enables development of graphics APIs, rapid innovation of new graphics
algorithms, and true general purpose computation on the graphics
processor with established PC software development tools.

Software-based scheduling: Larrabee features task scheduling which is
performed entirely with software, rather than in fixed function logic.
Therefore rendering pipelines and other complex software systems can
adjust their resource scheduling based each workload's unique
computing demand.

Execution threads: Larrabee architecture supports four execution
threads per core with separate register sets per thread. This allows
the use of a simple efficient in-order pipeline, but retains many of
the latency-hiding benefits of more complex out-of-order pipelines
when running highly parallel applications.

Ring network: Larrabee uses a 1024 bits-wide, bi-directional ring
network (i.e., 512 bits in each direction) to allow agents to
communicate with each other in low latency manner resulting in super
fast communication between cores.

"A key characteristic of this vector processor is a property we call
being vector complete...You can run 16 pixels in parallel, 16 vertices
in parallel, or 16 more general program indications in parallel,"
Seiler said.

xVecticism

unread,

Aug 4, 2008, 8:52:52 PM8/4/08

to

Here's the paper from Intel's website:
http://softwarecommunity.intel.com/UserFiles/en-us/File/larrabee_manycore.pdf

"NV55" <nvidi...@mail.com> wrote in message
news:eb967390-17aa-499c...@m45g2000hsb.googlegroups.com...

Chris M. Thomasson

unread,

Aug 4, 2008, 11:42:47 PM8/4/08

to

"NV55" <nvidi...@mail.com> wrote in message
news:eb967390-17aa-499c...@m45g2000hsb.googlegroups.com...

> http://news.cnet.com/8301-13924_3-10005391-64.html
>
>
> Intel has disclosed details on a chip that will compete directly with
> Nvidia and ATI and may take it into unchartered technological and
> market-segment waters.
>
> Larrabee will be a stand-alone chip, meaning it will be very different
> than the low-end--but widely used--integrated graphics that Intel now
> offers as part of the silicon that accompanies its processors. And
> Larrabee will be based on the universal Intel x86 architecture.

[...]

Are they saying that programming this chip will be easier than programming a
GPU because it honors the well established x86 arch?

xVecticism

unread,

Aug 5, 2008, 12:16:55 AM8/5/08

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:kXPlk.7164$QX3....@newsfe02.iad...

> Are they saying that programming this chip will be easier than programming
> a GPU because it honors the well established x86 arch?

http://anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3367&p=8

"
To the developer, it appears as exactly what it is - an arrangement of fully
cache coherent x86 microprocessors. The first iteration of Larrabee will
hide this fact from the OS through its graphics driver, but future versions
of the chip could conceivably populate task manager just like your desktop
x86 cores do today.

You have two options for harnessing the power of Larrabee: writing standard
DirectX/OpenGL code, or writing directly to the hardware using Larrabee
C/C++, which as it turns out is standard C (you can use compilers from MS,
Intel, GCC, etc...). In a sense, this is no different than what NVIDIA
offers with its GPUs - they will run DirectX/OpenGL code, or they can also
run C-code thanks to CUDA. The difference here is that writing directly to
Larrabee gives you some additional programming flexibility thanks to the GPU
being an array of fully functional x86 GPUs. Programming for x86
architectures is a paradigm that the software community as a whole is used
to, there's no learning curve, no new hardware limitations to worry about
and no waiting on additional iterations of CUDA to enable new features. You
treat Larrabee like you treat your host CPU. "

Torben Ægidius Mogensen

unread,

Aug 5, 2008, 5:04:27 AM8/5/08

to

"Chris M. Thomasson" <n...@spam.invalid> writes:

> "NV55" <nvidi...@mail.com> wrote in message
> news:eb967390-17aa-499c...@m45g2000hsb.googlegroups.com...
>> http://news.cnet.com/8301-13924_3-10005391-64.html
>>
>>

>> Larrabee will be a stand-alone chip, meaning it will be very different
>> than the low-end--but widely used--integrated graphics that Intel now
>> offers as part of the silicon that accompanies its processors. And
>> Larrabee will be based on the universal Intel x86 architecture.
> [...]
>
> Are they saying that programming this chip will be easier than
> programming a GPU because it honors the well established x86 arch?

They are, and it is utter fertilizer from male cattle.

Intel seems keen on convincing everyone that the x86 ISA is superior
to all others in all areas where you use processors, because it is
"well known" and "compatible". But in terms of compatibility, the
only advantage of x86 is for desktop Windows, and that is not
interesting in embedded areas or for GPUs. As for being "well known",
that only matters for assembler programmers, and there are probably
more who program ARM assembler or 8051 assembler than x86 assembler.

Torben

Wilco Dijkstra

unread,

Aug 5, 2008, 6:03:04 AM8/5/08

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message news:kXPlk.7164$QX3....@newsfe02.iad...
>

That's rubbish indeed. The cache coherency seems to be the only advantage
as other GPU also support C. However the claimed x86 "compatibility" isn't. If
you use C the ISA doesn't matter much, and if you write assembler then there is
no compatibility as the new SIMD instructions don't exist on any current x86's
and Larrabee doesn't appear to support SSE instructions either...

It would have made far more sense to use a simpler and more streamlined
ISA which would give a significant codesize, area and power saving, if not
a performance boost. But Intel is always keen to push their inefficient ISA
where it doesn't belong...

Wilco

Skybuck Flying

unread,

Aug 5, 2008, 7:30:52 AM8/5/08

to

As the number of cores goes up the watt requirements goes up too ?

Will we need a zillion watts of power soon ?

Bye,
Skybuck.

Dirk Bruere at NeoPax

unread,

Aug 5, 2008, 8:26:10 AM8/5/08

to

Since the ATI Radeon™ HD 4800 series has 800 cores you work it out.

--
Dirk

http://www.transcendence.me.uk/ - Transcendence UK
http://www.theconsensus.org/ - A UK political party
http://www.onetribe.me.uk/wordpress/?cat=5 - Our podcasts on weird stuff

Eric P.

unread,

Aug 5, 2008, 11:55:14 AM8/5/08

to

Maybe the special sauce is in the interconnect. They said:

"Ring network: Larrabee uses a 1024 bits-wide, bi-directional ring
network (i.e., 512 bits in each direction) to allow agents to
communicate with each other in low latency manner resulting in super
fast communication between cores."

I'm guessing you could build a pretty nice cache coherency network
if you never had to go off chip, though I'm trying to imagine how
a bi-ring network (a 2 directional token ring????) fits into that.

Eric

John Larkin

unread,

Aug 5, 2008, 11:24:04 AM8/5/08

to

On Tue, 5 Aug 2008 13:30:52 +0200, "Skybuck Flying"
<Blood...@hotmail.com> wrote:

>As the number of cores goes up the watt requirements goes up too ?

Not necessarily, if the technology progresses and the clock rates are
kept reasonable. And one can always throttle down the CPUs that aren't
busy.

>
>Will we need a zillion watts of power soon ?
>
>Bye,
> Skybuck.
>

I saw suggestions of something like 60 cores, 240 threads in the
reasonable future.

This has got to affect OS design.

John

Wes Felter

unread,

Aug 5, 2008, 12:19:05 PM8/5/08

to

Wilco Dijkstra wrote:
> "Chris M. Thomasson" <n...@spam.invalid> wrote in message news:kXPlk.7164$QX3....@newsfe02.iad...
>> "NV55" <nvidi...@mail.com> wrote in message

>>> Larrabee will be a stand-alone chip, meaning it will be very different

>>> than the low-end--but widely used--integrated graphics that Intel now
>>> offers as part of the silicon that accompanies its processors. And
>>> Larrabee will be based on the universal Intel x86 architecture.
>> [...]
>>
>> Are they saying that programming this chip will be easier than programming a GPU because it honors the well
>> established x86 arch?
>
> That's rubbish indeed. The cache coherency seems to be the only advantage
> as other GPU also support C.

The real advantage has been lost in the PR: Larrabee doesn't just
support C, it supports pthreads (and thus any other concurrency model
that can be built on pthreads). MIMD + cache coherence + x86 is a
significant advantage over CUDA (which I would describe as "C, but not
as we know it").

I noticed recently that Cilk++, TBB, Fortress, and X10 are all using
work-stealing rather than static partitioning. AFAIK MIMD is a
prerequisite for work-stealing, so many of the future parallel
programming languages may not be able to run on conventional GPUs at all.

Wes Felter - wes...@felter.org

Nick Maclaren

unread,

Aug 5, 2008, 12:40:59 PM8/5/08

to

In article <48987d7b$1@kcnews01>, Wes Felter <wes...@felter.org> writes:
|>
|> The real advantage has been lost in the PR: Larrabee doesn't just
|> support C, it supports pthreads (and thus any other concurrency model
|> that can be built on pthreads).

Unfortunately, the very concept of supporting C and pthreads is
ill-formed. The standards are so grossly inconsistent that God
alone knows what they mean. I know for a certainty that nobody
who worked on them does.

The reason that pthreads causes only as much problem as it does
is that users don't use pthreads as such for high-communication
applications, and so the incidence of failing race conditions and
exposed inconsistencies is low. That applies EVEN to codes written
solely for the x86!

If users start using Larrabee or Niagara etc. for high-communication
applications, and use pthreads, all that will change.

|> I noticed recently that Cilk++, TBB, Fortress, and X10 are all using
|> work-stealing rather than static partitioning. AFAIK MIMD is a
|> prerequisite for work-stealing, so many of the future parallel
|> programming languages may not be able to run on conventional GPUs
|> at all.

I notice your implication that those have a future - well, we can
agree that they don't have a past :-)

More seriously, I agree with you, whether it is those languages or
others. SIMD has been proven to be a massively successful model,
for a restricted set of problems. And attempts to extend it to a
very much wider range of problems have failed, over a period of 30+
years. I teach that you should always look at SIMD first, and use
it if at all possible, but don't be surprised if it isn't.

Regards,
Nick Maclaren.

Terje Mathisen

unread,

Aug 5, 2008, 1:20:20 PM8/5/08

to

Nick Maclaren wrote:
> In article <48987d7b$1@kcnews01>, Wes Felter <wes...@felter.org> writes:
> |>
> |> The real advantage has been lost in the PR: Larrabee doesn't just
> |> support C, it supports pthreads (and thus any other concurrency model
> |> that can be built on pthreads).
>
> Unfortunately, the very concept of supporting C and pthreads is
> ill-formed. The standards are so grossly inconsistent that God
> alone knows what they mean. I know for a certainty that nobody
> who worked on them does.

According to the nice white paper Intel published, they've already
extended pthreads:

http://softwarecommunity.intel.com/UserFiles/en-us/File/larrabee_manycore.pdf

"We have extended the API to also allow developers to specify thread
affinity with a particular HW thread or core."

and then they go on to say:

"Although P-threads is a powerful thread programming API, its
thread creation and thread switching costs may be too high for
some application threading. To amortize such costs, Larrabee
Native provides a task scheduling API based on a light weight
distributed task stealing scheduler [Blumofe et al. 1996]. A
production implementation of such a task programming API can
be found in Intel Thread Building Blocks"

The key missing item, at least to me, was a specification of the double
vs single precision performance. On the original Cell, double ran at 1/8
the speed of float, but it seems like more recent versions is fixing
this, to the point where you get about 50% of the throughput.

This is an important point for people (like me) who would like to have a
TFlop or so available in single chip and then gang up a cluster of them
to run serious simulation tasks.

Terje

--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Nick Maclaren

unread,

Aug 5, 2008, 1:52:18 PM8/5/08

to

In article <XZWdnTqNEJhJFgXV...@giganews.com>,

Clearly useful, but it doesn't address my points. If they had
defined a proper memory model, or sorted out the thread- safety
mess, that would be much more useful.

|> and then they go on to say:
|>
|> "Although P-threads is a powerful thread programming API, its
|> thread creation and thread switching costs may be too high for
|> some application threading. To amortize such costs, Larrabee
|> Native provides a task scheduling API based on a light weight
|> distributed task stealing scheduler [Blumofe et al. 1996]. A
|> production implementation of such a task programming API can
|> be found in Intel Thread Building Blocks"

Well, the actual specification may say something more rational;
as it stands, that is codswallop. Because there is so much state
in C and a pthread, you can't quiesce one section of code and start
another without doing it at the thread level.

|> The key missing item, at least to me, was a specification of the double
|> vs single precision performance. On the original Cell, double ran at 1/8
|> the speed of float, but it seems like more recent versions is fixing
|> this, to the point where you get about 50% of the throughput.

A key point compared with the chip being unprogrammable?

Yes, it's important, but let's see if it is possible to program the
thing and get reliable results even with integers! And that is so
far unproven. Remember the Itanic?

Regards,
Nick Maclaren.

Chris M. Thomasson

unread,

Aug 5, 2008, 3:42:01 PM8/5/08

to

"Terje Mathisen" <terje.m...@hda.hydro.com> wrote in message
news:XZWdnTqNEJhJFgXV...@giganews.com...

> Nick Maclaren wrote:
>> In article <48987d7b$1@kcnews01>, Wes Felter <wes...@felter.org> writes:
>> |> |> The real advantage has been lost in the PR: Larrabee doesn't just
>> |> support C, it supports pthreads (and thus any other concurrency model
>> |> that can be built on pthreads).
>>
>> Unfortunately, the very concept of supporting C and pthreads is
>> ill-formed. The standards are so grossly inconsistent that God
>> alone knows what they mean. I know for a certainty that nobody
>> who worked on them does.
>
> According to the nice white paper Intel published, they've already
> extended pthreads:
>
> http://softwarecommunity.intel.com/UserFiles/en-us/File/larrabee_manycore.pdf
>
> "We have extended the API to also allow developers to specify thread
> affinity with a particular HW thread or core."
>
> and then they go on to say:
>
> "Although P-threads is a powerful thread programming API, its
> thread creation and thread switching costs may be too high for
> some application threading. To amortize such costs, Larrabee
> Native provides a task scheduling API based on a light weight
> distributed task stealing scheduler [Blumofe et al. 1996]. A
> production implementation of such a task programming API can
> be found in Intel Thread Building Blocks"

FWIW, last time I checked, there was a very nasty race-condition in the TBB
"scheduler":

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/75e96ade96038553
(read all...)

Also, there is a much better work-stealing algorithm out there:

http://research.sun.com/scalable/pubs/DynamicWorkstealing.pdf

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8ad297f61b369a41

However, knowing SUN, its probably has a patent application...

John Larkin

unread,

Aug 5, 2008, 3:38:15 PM8/5/08

to

Oops, 4 threads per core is 320 threads.

My XP is currently running 33 processes and maybe a couple dozen
device drivers.

John

Chris M. Thomasson

unread,

Aug 5, 2008, 3:54:14 PM8/5/08

to

"John Larkin" <jjla...@highNOTlandTHIStechnologyPART.com> wrote in message
news:rtrg9458spr43ss94...@4ax.com...

I can see it now... A mega-core GPU chip that can dedicate 1 core per-pixel.

lol.

> This has got to affect OS design.

They need to completely rethink their multi-threaded synchronization
algorihtms. I have a feeling that efficient distributed non-blocking
algorihtms, which are comfortable running under a very weak cache coherency
model will be all the rage. Getting rid of atomic RMW or StoreLoad style
memory barriers is the first step.

Dirk Bruere at NeoPax

unread,

Aug 5, 2008, 3:57:22 PM8/5/08

to

Chris M. Thomasson wrote:
> "John Larkin" <jjla...@highNOTlandTHIStechnologyPART.com> wrote in
> message news:rtrg9458spr43ss94...@4ax.com...
>> On Tue, 5 Aug 2008 13:30:52 +0200, "Skybuck Flying"
>> <Blood...@hotmail.com> wrote:
>>
>>> As the number of cores goes up the watt requirements goes up too ?
>>
>> Not necessarily, if the technology progresses and the clock rates are
>> kept reasonable. And one can always throttle down the CPUs that aren't
>> busy.
>>
>>>
>>> Will we need a zillion watts of power soon ?
>>>
>>> Bye,
>>> Skybuck.
>>>
>>
>> I saw suggestions of something like 60 cores, 240 threads in the
>> reasonable future.
>
> I can see it now... A mega-core GPU chip that can dedicate 1 core
> per-pixel.

Why not?
Probably configured as a systolic array
http://en.wikipedia.org/wiki/Systolic_array

Rarius

unread,

Aug 5, 2008, 7:13:46 PM8/5/08

to

"Dirk Bruere at NeoPax" <dirk....@gmail.com> wrote in message
news:6fqv72F...@mid.individual.net...

> Skybuck Flying wrote:
>> As the number of cores goes up the watt requirements goes up too ?
>>
>> Will we need a zillion watts of power soon ?
>>
>> Bye,
>> Skybuck.
>
> Since the ATI Radeon™ HD 4800 series has 800 cores you work it out.

Just note that the 4870 needs TWO of those 6 pin power leads...

Rarius

---- Posted via Pronews.com - Premium Corporate Usenet News Provider ----
http://www.pronews.com offers corporate packages that have access to 100,000+ newsgroups

Terje Mathisen

unread,

Aug 6, 2008, 10:26:57 AM8/6/08

to

I'm very confident that the chip will actually work, and give useful,
repeatable results, but I don't expect things like fast (or even any?)
denormal handling except flush to zero.

Nick Maclaren

unread,

Aug 6, 2008, 10:38:52 AM8/6/08

to

In article <DoudnWJZvs0rKQTV...@giganews.com>,

Terje Mathisen <terje.m...@hda.hydro.com> writes:
|>
|> > Yes, it's important, but let's see if it is possible to program the
|> > thing and get reliable results even with integers! And that is so
|> > far unproven. Remember the Itanic?
|>
|> I'm very confident that the chip will actually work, and give useful,
|> repeatable results, but I don't expect things like fast (or even any?)
|> denormal handling except flush to zero.

That's not what I mean. Yes, I agree that the chip should work
according to specification. But, if it were so foul to program
that only a few dozen people, worldwide, could write code for it
that was reliable, efficient AND useful, then what?

The record of almost all seriously parallel features so far is
that they are straightforward to use for simple, vectorisable codes
like the BLAS and similar operations, or embarassingly parallel
applications, and utterly evil for almost anything else.

There isn't a problem for embarrassingly parallel codes - or is
there? Well, yes. The killer is that almost all computationally
intensive codes are memory intensive, too, and the memory conflict
kills you. There ARE exceptions, yes - cryptography, some work in
number theory, QCD and (to some extent) CODECs being the main ones.

Regards,
Nick Maclaren.

NV55

unread,

Aug 6, 2008, 10:28:53 PM8/6/08

to

On Aug 5, 5:26 am, Dirk Bruere at NeoPax <dirk.bru...@gmail.com>
wrote:

> Skybuck Flying wrote:
> > As the number of cores goes up the watt requirements goes up too ?
>
> > Will we need a zillion watts of power soon ?
>
> > Bye,
> > Skybuck.
>
> Since the ATI Radeon™ HD 4800 series has 800 cores you work it out.
>
> --
> Dirk

Each of the 800 "cores", which are simple stream processors, in
ATI RV770
(Radeon 4800 series) are not comparable to the 16, 24, 32 or 48
cores that will be in Larrabee. Just like they're not comparable to
the 240 "cores" in Nvidia GeForce GTX 280. Though I'm not saying
you didn't realize that, just for those that might not have.

John Larkin

unread,

Aug 6, 2008, 10:57:23 PM8/6/08

to

Run one process per CPU. Run the OS kernal, and nothing else, on one
CPU. Never context switch. Never swap. Never crash.

John

Nick Maclaren

unread,

Aug 7, 2008, 3:47:13 AM8/7/08

to

In article <b0pk941drmfvmlr4o...@4ax.com>,

John Larkin <jjla...@highNOTlandTHIStechnologyPART.com> writes:
|> On Tue, 5 Aug 2008 12:54:14 -0700, "Chris M. Thomasson"
|> <n...@spam.invalid> wrote:
|> >"John Larkin" <jjla...@highNOTlandTHIStechnologyPART.com> wrote in message
|> >news:rtrg9458spr43ss94...@4ax.com...
|> >

|> >> This has got to affect OS design.
|> >
|> >They need to completely rethink their multi-threaded synchronization
|> >algorihtms. I have a feeling that efficient distributed non-blocking
|> >algorihtms, which are comfortable running under a very weak cache coherency
|> >model will be all the rage. Getting rid of atomic RMW or StoreLoad style
|> >memory barriers is the first step.
|>
|> Run one process per CPU. Run the OS kernal, and nothing else, on one
|> CPU. Never context switch. Never swap. Never crash.

Been there - done that :-)

That is precisely how the early SMP systems worked, and it works
for dinky little SMP systems of 4-8 cores. But the kernel becomes
the bottleneck for many workloads even on those, and it doesn't
scale to large numbers of cores. So you HAVE to multi-thread the
kernel.

SGI were (are?) the leaders, but all of HP, IBM and Sun have been
along the same path. Modern Linux is multi-threaded.

Regards,
Nick Maclaren.

Terje Mathisen

unread,

Aug 7, 2008, 4:00:19 AM8/7/08

to

Nick Maclaren wrote:
> In article <DoudnWJZvs0rKQTV...@giganews.com>,
> Terje Mathisen <terje.m...@hda.hydro.com> writes:
> |>
> |> > Yes, it's important, but let's see if it is possible to program the
> |> > thing and get reliable results even with integers! And that is so
> |> > far unproven. Remember the Itanic?
> |>
> |> I'm very confident that the chip will actually work, and give useful,
> |> repeatable results, but I don't expect things like fast (or even any?)
> |> denormal handling except flush to zero.
>
> That's not what I mean. Yes, I agree that the chip should work
> according to specification. But, if it were so foul to program
> that only a few dozen people, worldwide, could write code for it
> that was reliable, efficient AND useful, then what?

I do believe asm programmers/thinkers will stay employed for the
foreseeable future, yes. :-)

More seriously, the gather/scatter hw seems like a very good match for
more advanced codes that use sparse matrix techniques, with indirect
addressing etc.:

The G/S unit should be able to take a group of 16 aligned pointers to
data items, then lookup all very quickly and return the set of actual
data to be worked on. If the data blocks have been allocated in
sequential memory, then the "load all items from a given cache line in a
single cycle" would make such access patterns quite efficient.

>
> The record of almost all seriously parallel features so far is
> that they are straightforward to use for simple, vectorisable codes
> like the BLAS and similar operations, or embarassingly parallel
> applications, and utterly evil for almost anything else.

You're (unfortunately) almost certainly right. :-(

I believe most problems _can_ be mapped onto LRB style architectures,
but not without significant work by good programmers, i.e. nothing at
all like the "just recompile with our magic compiler" that seems to be
the holy grail.

Re. total memory bandwidth:

I agree that LRB will be no good at all for codes that work best without
caches, i.e. where blocking is impossible. The big question is if this
is an absolute requirement of the underlying problem, or if there is
some other way to solve it, even at the cost ofdoing (much) more work?

This is of course the area where nearly everyone has been working for
the last 2-3 decades, as the memory wall have rushed closer and closer.
I.e. this problem must be solved no matter which architecture you work with!

Nick Maclaren

unread,

Aug 7, 2008, 4:21:26 AM8/7/08

to

In article <BdqdnZKJo_MJNgfV...@giganews.com>,

Terje Mathisen <terje.m...@hda.hydro.com> writes:
|>
|> > That's not what I mean. Yes, I agree that the chip should work
|> > according to specification. But, if it were so foul to program
|> > that only a few dozen people, worldwide, could write code for it
|> > that was reliable, efficient AND useful, then what?
|>
|> I do believe asm programmers/thinkers will stay employed for the
|> foreseeable future, yes. :-)

You know that's not what I meant :-)

More seriously, the history of the past 30 years has been to reduce
the requirements for such people by dropping standards. Will that
deliver something that can be claimed to work, even by salesdroids?
The jury hasn't even retired yet!

Several parallel systems of the past failed because their users
couldn't handle them, not because they didn't work. Are we about to
see a change? I just don't know.

|> More seriously, the gather/scatter hw seems like a very good match for
|> more advanced codes that use sparse matrix techniques, with indirect
|> addressing etc.:

[ Other relevant points snipped ]

Actually, I disagree. I think that it's a gimmick. Few people are
interested in sparsity within a cache line (or even page). What is
needed is something too radical for Intel, which is to separate off
the MMU aspects of the ISA and allow much better designed control
of cache preloading. And that DOESN'T mean adding yet another hack,
but a step back and serious reconsideration.

Sun's scout thread approach is along the right lines, though I doubt
that it is a very good one.

For example, consider an architecture where there was a sparse 'touch'
instruction, with some kind of prioritisation. Combine that with an
LRU algorithm that used different rates for touched pages that had not
yet been accessed and ones that had. I can see how to generate code
for that which would have the potential of reducing latency considerably.

Regards,
Nick Maclaren.

nedbrek

unread,

Aug 7, 2008, 6:44:58 AM8/7/08

to

Hello all,

"Wilco Dijkstra" <Wilco.remove...@ntlworld.com> wrote in message
news:jzVlk.63795$dz3....@newsfe20.ams2...

> "Chris M. Thomasson" <n...@spam.invalid> wrote in message
> news:kXPlk.7164$QX3....@newsfe02.iad...

>> Are they saying that programming this chip will be easier than
>> programming a GPU because it honors the well established x86 arch?
>
> That's rubbish indeed. The cache coherency seems to be the only advantage
> as other GPU also support C. However the claimed x86 "compatibility"
> isn't. If
> you use C the ISA doesn't matter much, and if you write assembler then
> there is
> no compatibility as the new SIMD instructions don't exist on any current
> x86's
> and Larrabee doesn't appear to support SSE instructions either...
>
> It would have made far more sense to use a simpler and more streamlined
> ISA which would give a significant codesize, area and power saving, if not
> a performance boost. But Intel is always keen to push their inefficient
> ISA
> where it doesn't belong...

"Never ungerestimate the power of x86!" - Yogurt

- Forward compatability: DX9(IIRC?) used the assembly language for
NVDIA's GPU. I remember at Intel someone having to write a dynamic
translator. Of course, NVIDIA was in the same boat as everyone else for
their next generation (which wanted a whole new instruction set). x86 has
been a stable platform for over twenty years.

- It's not an advantage for the customer, it's an advantage for the
designers! Think verification - all your unit tests continue to work.
Tools infrastructure (simulators, etc.)

The overhead is somewhat painful for tiny cores. But there are a lot of
tricks you can play... And when you are a process generation ahead of
everyone else...

Ned

Not speaking for Intel

John Larkin

unread,

Aug 7, 2008, 10:08:52 AM8/7/08

to

On 7 Aug 2008 07:47:13 GMT, nm...@cus.cam.ac.uk (Nick Maclaren) wrote:

>
>In article <b0pk941drmfvmlr4o...@4ax.com>,
>John Larkin <jjla...@highNOTlandTHIStechnologyPART.com> writes:
>|> On Tue, 5 Aug 2008 12:54:14 -0700, "Chris M. Thomasson"
>|> <n...@spam.invalid> wrote:
>|> >"John Larkin" <jjla...@highNOTlandTHIStechnologyPART.com> wrote in message
>|> >news:rtrg9458spr43ss94...@4ax.com...
>|> >
>|> >> This has got to affect OS design.
>|> >
>|> >They need to completely rethink their multi-threaded synchronization
>|> >algorihtms. I have a feeling that efficient distributed non-blocking
>|> >algorihtms, which are comfortable running under a very weak cache coherency
>|> >model will be all the rage. Getting rid of atomic RMW or StoreLoad style
>|> >memory barriers is the first step.
>|>
>|> Run one process per CPU. Run the OS kernal, and nothing else, on one
>|> CPU. Never context switch. Never swap. Never crash.
>
>Been there - done that :-)
>
>That is precisely how the early SMP systems worked, and it works
>for dinky little SMP systems of 4-8 cores. But the kernel becomes
>the bottleneck for many workloads even on those, and it doesn't
>scale to large numbers of cores. So you HAVE to multi-thread the
>kernel.

Why? All it has to do is grant run permissions and look at the big
picture. It certainly wouldn't do I/O or networking or file
management. If memory allocation becomes a burden, it can set up four
(or fourteen) memory-allocation cores and let them do the crunching.
Why multi-thread *anything* when hundreds or thousands of CPUs are
available?

Using multicore properly will require undoing about 60 years of
thinking, 60 years of believing that CPUs are expensive.

John

Nick Maclaren

unread,

Aug 7, 2008, 10:25:01 AM8/7/08

to

In article <d10m94d7etb6sfcem...@4ax.com>,

I don't have time to describe 40 years of experience to you, and
it is better written up in books, anyway. Microkernels of the sort
you mention were trendy a decade or two back (look up Mach), but
introduced too many bottlenecks.

In theory, the kernel doesn't have to do I/O or networking, but
have you ever used a system where they were outside it? I have.

The reason that exporting them to multiple CPUs doesn't solve the
scalability problems is that the interaction rate goes up more
than linearly with the number of CPUs. And the same problem
applies to memory management, if you are going to allow shared
memory - or even virtual shared memory, as in PGAS languages.

And so it goes. TANSTAAFL.

|> Using multicore properly will require undoing about 60 years of
|> thinking, 60 years of believing that CPUs are expensive.

Now, THAT is true.

Regards,
Nick Maclaren.

Chris M. Thomasson

unread,

Aug 7, 2008, 10:42:36 AM8/7/08

to

"John Larkin" <jjla...@highNOTlandTHIStechnologyPART.com> wrote in message

news:d10m94d7etb6sfcem...@4ax.com...

FWIW, I have a memory allocation algorithm which can scale because its based
on per-thread/core/node heaps:

http://groups.google.com/group/comp.arch/browse_frm/thread/24c40d42a04ee855

AFAICT, there is absolutely no need for memory-allocation cores. Each thread
can have a private heap such that local allocations do not need any
synchronization. Also, thread local deallocations of memory do not need any
sync. Local meaning that Thread A allocates memory M which is subsequently
freed by Thread A. When a threads memory pool is exhausted, it then tries to
allocate from the core local heap. If that fails, then it asks the system,
and perhaps virtual memory comes into play.

A scaleable high-level memory allocation algorithm for a super-computer
could look something like:
_____________________________________________________________
void* malloc(size_t sz) {
void* mem;

/* level 1 - thread local */
if ((! mem = Per_Thread_Try_Allocate(sz))) {

/* level 2 - core local */
if ((! mem = Per_Core_Try_Allocate(sz))) {

/* level 3 - physical chip local */
if ((! mem = Per_Chip_Try_Allocate(sz))) {

/* level 4 - node local */
if ((! mem = Per_Node_Try_Allocate(sz))) {

/* level 5 - system-wide */
if ((! mem = System_Try_Allocate(sz))) {

/* level 6 - failure */
Report_Allocation_Failure(sz);
return NULL;
}
}
}
}
}

return mem;
}
_____________________________________________________________

Level 1 does not need any atomic RMW OR membars at all.

Level 2 does not need membars, but needs atomic RMW.

Level 3 would need membars and atomic RMW.

Level 4 is same as level 3

Level 5 is worst case senerio, may need MPI...

Level 6 is total memory exhaustion! Ouch...

All local frees have same overhead while all remote frees need atomic RMW
and possibly membars.

This algorithm can scale to very large numbers of cores, chips and nodes.

> Using multicore properly will require undoing about 60 years of
> thinking, 60 years of believing that CPUs are expensive.

The bottleneck is the cache-coherency system. Luckily, there is years of
experience is dealing with weak cache schemes... Think RCU.

> Why multi-thread *anything* when hundreds or thousands of CPUs are
> available?

You don't think there is any need for communication between cores on a chip?

Chris M. Thomasson

unread,

Aug 7, 2008, 10:44:19 AM8/7/08

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:PNDmk.8961$Bt6....@newsfe04.iad...

> "John Larkin" <jjla...@highNOTlandTHIStechnologyPART.com> wrote in
> message news:d10m94d7etb6sfcem...@4ax.com...

[...]

>> Using multicore properly will require undoing about 60 years of
>> thinking, 60 years of believing that CPUs are expensive.
>
> The bottleneck is the cache-coherency system.

I meant to say:

/One/ bottleneck is the cache-coherency system.

Nick Maclaren

unread,

Aug 7, 2008, 11:17:46 AM8/7/08

to

In article <PNDmk.8961$Bt6....@newsfe04.iad>,

"Chris M. Thomasson" <n...@spam.invalid> writes:
|>
|> FWIW, I have a memory allocation algorithm which can scale because its based
|> on per-thread/core/node heaps:
|>

|> AFAICT, there is absolutely no need for memory-allocation cores. Each thread
|> can have a private heap such that local allocations do not need any
|> synchronization.

Provided that you can live with the constraints of that approach.
Most applications can, but not all.

Regards,
Nick Maclaren.

Dirk Bruere at NeoPax

unread,

Aug 7, 2008, 4:57:07 PM8/7/08

to

True, but they seem to be positioning Larrabee in the same tech segment
as video cards. Which makes sense since a SIMD system is the easiest to
program. If they want N general purpose cores doing general purpose
computing the whole thing will bog down somewhere between 16 and 32. A
lot of the R&D theory was done 30+ years ago.

Maybe they will try something radical, like an ancient data flow
architecture, but I doubt it.

Robert Myers

unread,

Aug 7, 2008, 5:23:16 PM8/7/08

to

On Aug 7, 4:57 pm, Dirk Bruere at NeoPax <dirk.bru...@gmail.com>
wrote:

>

> > Each of the 800 "cores", which are simple stream processors, in
> > ATI RV770
> > (Radeon 4800 series) are not comparable to the 16, 24, 32 or 48
> > cores that will be in Larrabee. Just like they're not comparable to
> > the 240 "cores" in Nvidia GeForce GTX 280. Though I'm not saying
> > you didn't realize that, just for those that might not have.
>
> True, but they seem to be positioning Larrabee in the same tech segment
> as video cards. Which makes sense since a SIMD system is the easiest to
> program. If they want N general purpose cores doing general purpose
> computing the whole thing will bog down somewhere between 16 and 32. A
> lot of the R&D theory was done 30+ years ago.
>
> Maybe they will try something radical, like an ancient data flow
> architecture, but I doubt it.
>

"General purpose" GPU's are not really general purpose, but they
aren't doing graphics, either.

Robert.

Wilco Dijkstra

unread,

Aug 7, 2008, 7:20:19 PM8/7/08

to

"nedbrek" <ned...@yahoo.com> wrote in message news:KmAmk.371$_H1.310@trnddc05...

> Hello all,
>
> "Wilco Dijkstra" <Wilco.remove...@ntlworld.com> wrote in message news:jzVlk.63795$dz3....@newsfe20.ams2...
>> "Chris M. Thomasson" <n...@spam.invalid> wrote in message news:kXPlk.7164$QX3....@newsfe02.iad...
>>> Are they saying that programming this chip will be easier than programming a GPU because it honors the well
>>> established x86 arch?
>>
>> That's rubbish indeed. The cache coherency seems to be the only advantage
>> as other GPU also support C. However the claimed x86 "compatibility" isn't. If
>> you use C the ISA doesn't matter much, and if you write assembler then there is
>> no compatibility as the new SIMD instructions don't exist on any current x86's
>> and Larrabee doesn't appear to support SSE instructions either...
>>
>> It would have made far more sense to use a simpler and more streamlined
>> ISA which would give a significant codesize, area and power saving, if not
>> a performance boost. But Intel is always keen to push their inefficient ISA
>> where it doesn't belong...
>
> "Never ungerestimate the power of x86!" - Yogurt

Maybe he meant power as in power consumption? :-)

> - Forward compatability: DX9(IIRC?) used the assembly language for NVDIA's GPU. I remember at Intel someone having
> to write a dynamic translator. Of course, NVIDIA was in the same boat as everyone else for their next generation
> (which wanted a whole new instruction set). x86 has been a stable platform for over twenty years.

Well there is no existing x86 GPU code AFAIK, so any ISA would do, as
long as it would be easily extensible.

> - It's not an advantage for the customer, it's an advantage for the designers! Think verification - all your unit
> tests continue to work. Tools infrastructure (simulators, etc.)

That's true, but I'd argue that a simpler core would be easier to design
and verify, so reusing existing infrastructure is far less important. The key
thing is that any savings in terms of size, power, code density etc are
multiplied, so trying to save on design time is the wrong tradeoff.

> The overhead is somewhat painful for tiny cores. But there are a lot of tricks you can play... And when you are a
> process generation ahead of everyone else...

Being ahead on process is not enough - compare Atom with an ARM
1 or 2 process generations behind. All those stories claiming Atom
was going to be used in mobiles were just hilarious as most ARMs at
full speed use less power than Atom doing nothing in its deepest sleep...

Wilco

nedbrek

unread,

Aug 8, 2008, 6:37:19 AM8/8/08

to

Hello all,

"Wilco Dijkstra" <Wilco.remove...@ntlworld.com> wrote in message

news:TqLmk.186928$x66....@newsfe25.ams2...

>
> "nedbrek" <ned...@yahoo.com> wrote in message
> news:KmAmk.371$_H1.310@trnddc05...
>> Hello all,
>

> Well there is no existing x86 GPU code AFAIK, so any ISA would do, as
> long as it would be easily extensible.

It's not so much "easily extensible" as "extensible done right". A lot of
people have tried to produce a long line of compatible processors. Only IBM
and Intel have done it successfully over decades (I would count ARM's start
as ~1995, when DEC's StrongArm made people take it seriously).

>> - It's not an advantage for the customer, it's an advantage for the
>> designers! Think verification - all your unit tests continue to work.
>> Tools infrastructure (simulators, etc.)
>
> That's true, but I'd argue that a simpler core would be easier to design
> and verify, so reusing existing infrastructure is far less important. The
> key
> thing is that any savings in terms of size, power, code density etc are
> multiplied, so trying to save on design time is the wrong tradeoff.

That's the question. How long will these tiny cores remain tiny? LRB is
already pretty complicated (multithreaded and superscalar). Nick is trying
to warn people against 1000's of cores, I'd add my voice to that. People
are going to want more complicated cores (better single thread performance).

>> The overhead is somewhat painful for tiny cores. But there are a lot
>> of tricks you can play... And when you are a process generation ahead of
>> everyone else...
>
> Being ahead on process is not enough - compare Atom with an ARM
> 1 or 2 process generations behind. All those stories claiming Atom
> was going to be used in mobiles were just hilarious as most ARMs at
> full speed use less power than Atom doing nothing in its deepest sleep...

I haven't followed too closely, but I recall Intel always targetting Atom at
ultra mobiles (and maybe set top boxes). For a first generation, Atom looks
like a tremendous success. I would imagine future remaps will start
ratcheting down the power (and phone power budget may come up...).

Ned

Bernd Paysan

unread,

Aug 8, 2008, 7:02:15 AM8/8/08

to

Nick Maclaren wrote:
> In theory, the kernel doesn't have to do I/O or networking, but
> have you ever used a system where they were outside it? I have.

Actually, doing I/O or networking in a "main" CPU is waste of resources. Any
sane architecture (CDC 6600, mainframes) has a bunch of multi-threaded IO
processors, which you program so that the main CPU has little effort to
deal with IO.

This works well even when you do virtualization. The main CPU sends a
pointer to an IO processor program ("high-level abstraction", not the
device driver details) to the IO processor, which in turn runs the device
driver to get the data in or out. In a VM, the VM monitor has to
sanity-check the command, maybe rewrites it ("don't write to track 3 of
disk 5, write it to the 16 sectors starting at sector 8819834 in disk 1,
which is where the virtual volume of this VM sits").

The fact that in PCs the main CPU is doing IO (even down to the level of
writing to individual IO ports) is a consequence of saving CPUs - no money
for an IO processor, the 8088 can do that itself just fine. Why we'll soon
have 32 x86 cores, but still no IO processor is beyond what I can
understand.

Basically all IO in a modern PC is sending fixed- or variable-sized packets
over some sort of network - via SATA/SCSI, via USB, Firewire, or Ethernet,
etc.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Jan Panteltje

unread,

Aug 8, 2008, 7:30:04 AM8/8/08

to

On a sunny day (Fri, 08 Aug 2008 13:02:15 +0200) it happened Bernd Paysan
<bernd....@gmx.de> wrote in <nmltm5-...@annette.zetex.de>:

Do not forget, since the days of 8088, and maybe CPUs running at about 13 MHz,
we now run at 3.4 GHz, 3400 / 13 = 261 x faster.
Also even faster because of better architectures.
This leaves plenty of time for a CPU to do normal IO.
And in fact the IO has been hardware supported always.
For example, although you can poll a serial port bit by bit, there is a hardware shift register,
hardware FIFO too.
Although you can construct sectors for a floppy in software bit by bit, there is a floppy controller
with write pre-compensation etc.. all in hardware.
Although you could do graphics there is a graphics card with hardware acceleration.
the first 2 are included in the chip set, maybe the graphics too.
The same thing for Ethernet, it is a dedicated chip, or included in the chip set,
taking the place of your 'IO processor'.
Same thing for hard disks, and those may even have on board encryption, all you
have to do is specify a sector number and send the sector data.

So.. no real need for a separate IO processor, in fact you likely find a processor
in all that dedicated hardware, or maybe a FPGA.

Terje Mathisen

unread,

Aug 8, 2008, 10:13:36 AM8/8/08

to

nedbrek wrote:
> That's the question. How long will these tiny cores remain tiny? LRB is
> already pretty complicated (multithreaded and superscalar). Nick is trying
> to warn people against 1000's of cores, I'd add my voice to that. People
> are going to want more complicated cores (better single thread performance).

When Intel can manufacture 32 or 64 LRB cores on a single chip, using
the same process as the 8 Core 2 cores that fit in the same area, I
certainly expect something we've been talking about for years, i.e.
heterogeneous multi-processing:

Maybe 2-4 Core-class fast cores, and 32-48 LRB cores, with an OS which
knows about the different needs of different applications.

This way throughput apps would run across the LRB cores, while
performance-critical single-thread tasks would stay on one of the fast
cores.

Having a single in-order core as the only active core when the system is
(mostly) idling would also save quite a bit of power, right?

John Larkin

unread,

Aug 8, 2008, 10:40:53 AM8/8/08

to

That's the IBM "channel controller" concept: add complexm specialized
dma-based i/o controllers to take the load off the CPU. But if you
have hundreds of CPU's, the strategy changes.

John

Jan Panteltje

unread,

Aug 8, 2008, 10:57:21 AM8/8/08

to

On a sunny day (Fri, 08 Aug 2008 07:40:53 -0700) it happened John Larkin
<jjla...@highNOTlandTHIStechnologyPART.com> wrote in
<bkmo94dhicn8ipk01...@4ax.com>:

>That's the IBM "channel controller" concept: add complexm specialized
>dma-based i/o controllers to take the load off the CPU. But if you
>have hundreds of CPU's, the strategy changes.
>
>John

Ultimately you will have to move bytes, from one CPU to the other,
or from dedicated IO to one CPU, and things have to happen at the right moment.
Results will never be available before requests......
It is a bit like Usenet: (smile), there are many 'processors' (readers. posters,
lurkers) here, some output some data at some time in response to some event,
could be a question, others read it, later, much later perhaps, see the problem?
Watched the Olympic opening, I must say the Chinese make a beautiful event.
Never got boring, the previous one was ugly and not worth looking at, but
anyways, so many LEDs? And some projection!
Seems they are ahead in many a field.
Would you not be scare to death if you were a little girl hanging 25 meters
above the floor from some steel cables.....
Chinese are brave too :-)

John Larkin

unread,

Aug 8, 2008, 11:54:36 AM8/8/08

to

On Thu, 7 Aug 2008 07:44:19 -0700, "Chris M. Thomasson"
<n...@spam.invalid> wrote:

>
>"Chris M. Thomasson" <n...@spam.invalid> wrote in message
>news:PNDmk.8961$Bt6....@newsfe04.iad...
>> "John Larkin" <jjla...@highNOTlandTHIStechnologyPART.com> wrote in
>> message news:d10m94d7etb6sfcem...@4ax.com...
>[...]
>>> Using multicore properly will require undoing about 60 years of
>>> thinking, 60 years of believing that CPUs are expensive.
>>
>> The bottleneck is the cache-coherency system.
>
>I meant to say:
>
>/One/ bottleneck is the cache-coherency system.
>
>

I think the trend is to have the cores surround a common shared cache;
a little local memory (and cache, if the local memory is slower for
some reason) per CPU wouldn't hurt.

Cache coherency is simple if you don't insist on flat-out maximum
performance. What we should insist on is flat-out unbreakable systems,
and buy better silicon to get the performance back if we need it.

I'm reading Showstopper!, the story of the development of NT. It's a
great example of why we need a different way of thinking about OS's.

Silicon is going to make that happen, finally free us of the tyranny
of CPU-as-precious-resource. A lot of programmers aren't going to like
this.

John

Robert Myers

unread,

Aug 8, 2008, 12:45:36 PM8/8/08

to

On Aug 8, 10:13 am, Terje Mathisen <terje.mathi...@hda.hydro.com>
wrote:

Many cores on a chip really opens the design space. I'm expecting a
pre-Cambrian explosion of sorts. I can think of lots of
possibilities, and I'm sure there are many I can't imagine.

Robert.

Jan Panteltje

unread,

Aug 8, 2008, 12:55:14 PM8/8/08

to

On a sunny day (Fri, 08 Aug 2008 08:54:36 -0700) it happened John Larkin
<jjla...@highNOTlandTHIStechnologyPART.com> wrote in
<8v4m945fbcvrln66t...@4ax.com>:

>>/One/ bottleneck is the cache-coherency system.
>>
>>
>
>I think the trend is to have the cores surround a common shared cache;
>a little local memory (and cache, if the local memory is slower for
>some reason) per CPU wouldn't hurt.
>
>Cache coherency is simple if you don't insist on flat-out maximum
>performance. What we should insist on is flat-out unbreakable systems,
>and buy better silicon to get the performance back if we need it.
>
>I'm reading Showstopper!, the story of the development of NT. It's a
>great example of why we need a different way of thinking about OS's.
>
>Silicon is going to make that happen, finally free us of the tyranny
>of CPU-as-precious-resource. A lot of programmers aren't going to like
>this.
>
>John

John Lennon:

'You know I am a dreamer'
....
' And I hope you join us someday'

(well what I remember of it).
You should REALLY try to program a Cell processor some day.

Dunno what you have against programmers, there are programmaers who
are amazingly clever with hardware resources.
I dunno about NT and MS, but IIRC MS plucked programmers from
unis, and sort of brainwashed them then.. the result we all know.

Message has been deleted

Martin Brown

unread,

Aug 8, 2008, 1:03:09 PM8/8/08

to

John Larkin wrote:
> On Thu, 7 Aug 2008 07:44:19 -0700, "Chris M. Thomasson"
> <n...@spam.invalid> wrote:
>
>> "Chris M. Thomasson" <n...@spam.invalid> wrote in message
>> news:PNDmk.8961$Bt6....@newsfe04.iad...
>>> "John Larkin" <jjla...@highNOTlandTHIStechnologyPART.com> wrote in
>>> message news:d10m94d7etb6sfcem...@4ax.com...
>> [...]
>>>> Using multicore properly will require undoing about 60 years of
>>>> thinking, 60 years of believing that CPUs are expensive.

>>> The bottleneck is the cache-coherency system.
>> I meant to say:
>>
>> /One/ bottleneck is the cache-coherency system.
>
> I think the trend is to have the cores surround a common shared cache;
> a little local memory (and cache, if the local memory is slower for
> some reason) per CPU wouldn't hurt.

For small N this can be made work very nicely.

>
> Cache coherency is simple if you don't insist on flat-out maximum
> performance. What we should insist on is flat-out unbreakable systems,
> and buy better silicon to get the performance back if we need it.

Existing cache hardware on Pentiums still isn't quite good enough. Try
probing its memory with large power of two strides and you fall over a
performance limitation caused by the cheap and cheerful way it uses
lower address bits for cache associativity. See Steven Johnsons post in
the FFT Timing thread.

>
> I'm reading Showstopper!, the story of the development of NT. It's a
> great example of why we need a different way of thinking about OS's.

If it is anything like the development of OS/2 you get to see very
bright guys reinvent things from scratch that were already known in the
mini and mainframe world (sometimes with the same bugs and quirks as the
first iteration of big iron code suffered from).

NT 3.51 was a particularly good vintage. After that bloatware set in.

>
> Silicon is going to make that happen, finally free us of the tyranny
> of CPU-as-precious-resource. A lot of programmers aren't going to like
> this.

CPU cycles are cheap and getting cheaper and human cycles are expensive
and getting more expensive. But that also says that we should also be
using better tools and languages to manage the hardware.

Unfortunately time to market advantage tends to produce less than robust
applications with pretty interfaces and fragile internals. You can after
all send out code patches over the Internet all too easily ;-)

Since people buy the stuff (I would not wish Vista on my worst enemy by
the way) even with all its faults the market rules, and market forces
are never wrong...

Most of what you are claiming as advantages of separate CPUs can be
achieved just as easily with hardware support for protected user memory
and security privilige rings. It is more likely that virtualisation of
single, dual or quad cores will become common in domestic PCs.

There was a Pentium exploit documented against some brands of Unix. eg.
http://www.ssi.gouv.fr/fr/sciences/fichiers/lti/cansecwest2006-duflot.pdf

Loads of physical CPUs just creates a different set of complexity
problems. And they are a pig to program efficiently.

Regards,
Martin Brown
** Posted from http://www.teranews.com **

Chris M. Thomasson

unread,

Aug 8, 2008, 3:40:59 PM8/8/08

to

"Terje Mathisen" <terje.m...@hda.hydro.com> wrote in message
news:04KdnQblxJkMyQHV...@giganews.com...

> nedbrek wrote:
>> That's the question. How long will these tiny cores remain tiny? LRB is
>> already pretty complicated (multithreaded and superscalar). Nick is
>> trying to warn people against 1000's of cores, I'd add my voice to that.
>> People are going to want more complicated cores (better single thread
>> performance).
>
> When Intel can manufacture 32 or 64 LRB cores on a single chip, using the
> same process as the 8 Core 2 cores that fit in the same area, I certainly
> expect something we've been talking about for years, i.e. heterogeneous
> multi-processing:
>
> Maybe 2-4 Core-class fast cores, and 32-48 LRB cores, with an OS which
> knows about the different needs of different applications.

Intel has an experimental 80-core chip:

http://news.cnet.com/Intel-shows-off-80-core-processor/2100-1006_3-6158181.html?hhTest=1

Not sure what they want to do with it though...

John Larkin

unread,

Aug 8, 2008, 4:16:07 PM8/8/08

to

Yes. Everybody thought they could write from scratch a better
(whatever) than the other groups had already developed, and in a few
weeks yet. There were "two inch pipes full of piss flowing in both
directions" between graphics groups.

Code reuse is not popular among people who live to write code.

>
>NT 3.51 was a particularly good vintage. After that bloatware set in.
>>
>> Silicon is going to make that happen, finally free us of the tyranny
>> of CPU-as-precious-resource. A lot of programmers aren't going to like
>> this.
>
>CPU cycles are cheap and getting cheaper and human cycles are expensive
>and getting more expensive. But that also says that we should also be
>using better tools and languages to manage the hardware.
>
>Unfortunately time to market advantage tends to produce less than robust
>applications with pretty interfaces and fragile internals. You can after
>all send out code patches over the Internet all too easily ;-)

NT followed the classic methodology: code fast, build the OS,
test/test/test looking for bugs. I think there were 2000 known bugs in
the first developer's release. There must have been ballpark 100K bugs
created and fixed during development.

>
>Since people buy the stuff (I would not wish Vista on my worst enemy by
>the way) even with all its faults the market rules, and market forces
>are never wrong...
>
>Most of what you are claiming as advantages of separate CPUs can be
>achieved just as easily with hardware support for protected user memory
>and security privilige rings. It is more likely that virtualisation of
>single, dual or quad cores will become common in domestic PCs.

Intel was criminally negligent in not providing better hardware
protections, and Microsoft a co-criminal in not using what little was
available. Microsoft has never seen data that it didn't want to
execute. I ran PDP-11 timeshare systems that couldn't be crashed by
hostile users, and ran for months between power failures.

>
>There was a Pentium exploit documented against some brands of Unix. eg.
>http://www.ssi.gouv.fr/fr/sciences/fichiers/lti/cansecwest2006-duflot.pdf
>
>Loads of physical CPUs just creates a different set of complexity
>problems. And they are a pig to program efficiently.

So program them inefficiently. Stop thinking about CPU cycles as
precious resources, and start think that users matter more. I have
personally spent far more time recovering from Windows crashes and
stupidities than I've spent waiting for compute-bound stuff to run.

If the OS runs alone on one CPU, totally hardware protected from all
other processes, totally in control, that's not complex.

As transistors get smaller and cheaper, and cores multiply into the
hundreds, the limiting resource will become power dissipation. So if
every process gets its own CPU, and idle CPUs power down, and there's
no context switching overhead, the multi-CPU system is net better off.

What else are we gonna do with 1024 cores? We'll probably see it on
Linux first.

John

Dirk Bruere at NeoPax

unread,

Aug 8, 2008, 6:25:43 PM8/8/08

to

I was doing/learning all this stuff 30 years ago.
We even developed a loosely couple multi uP system where each module had
a comms processor, and apps processor and an OS processor. Back then all
these problems had already been analysed to death, and solutions found
(if they existed). The future of Intel/MS R&D ought to be reading IEEE
papers from the 60s/70s

Chris M. Thomasson

unread,

Aug 8, 2008, 7:57:57 PM8/8/08

to

"John Larkin" <jjla...@highNOTlandTHIStechnologyPART.com> wrote in message

news:379p94ljurcolgmk5...@4ax.com...

One point:

RCU can scale to thousands of cores; Linux uses that algorithm in its kernel
today.

JosephKK

unread,

Aug 9, 2008, 11:53:48 AM8/9/08

to

On Tue, 05 Aug 2008 08:24:04 -0700, John Larkin
<jjla...@highNOTlandTHIStechnologyPART.com> wrote:

>On Tue, 5 Aug 2008 13:30:52 +0200, "Skybuck Flying"
><Blood...@hotmail.com> wrote:
>
>>As the number of cores goes up the watt requirements goes up too ?
>

>Not necessarily, if the technology progresses and the clock rates are
>kept reasonable. And one can always throttle down the CPUs that aren't
>busy.

>
>>
>>Will we need a zillion watts of power soon ?
>>
>>Bye,
>> Skybuck.
>>
>

>I saw suggestions of something like 60 cores, 240 threads in the
>reasonable future.

>
>This has got to affect OS design.
>

>John

This won't bother *nix class OS's They have been scaled past 10
thousand cores already. Other OS are on their own.

JosephKK

unread,

Aug 9, 2008, 11:58:39 AM8/9/08

to

On Tue, 5 Aug 2008 12:54:14 -0700, "Chris M. Thomasson"
<n...@spam.invalid> wrote:

>"John Larkin" <jjla...@highNOTlandTHIStechnologyPART.com> wrote in message

>news:rtrg9458spr43ss94...@4ax.com...

>> On Tue, 5 Aug 2008 13:30:52 +0200, "Skybuck Flying"
>> <Blood...@hotmail.com> wrote:
>>
>>>As the number of cores goes up the watt requirements goes up too ?
>>
>> Not necessarily, if the technology progresses and the clock rates are
>> kept reasonable. And one can always throttle down the CPUs that aren't
>> busy.
>>
>>>
>>>Will we need a zillion watts of power soon ?
>>>
>>>Bye,
>>> Skybuck.
>>>
>>
>> I saw suggestions of something like 60 cores, 240 threads in the
>> reasonable future.
>

>I can see it now... A mega-core GPU chip that can dedicate 1 core per-pixel.
>
>lol.
>

At that point you should integrate them directly into the display.
Then you could get to get to giga core systems.

>
>
>
>> This has got to affect OS design.
>

>They need to completely rethink their multi-threaded synchronization
>algorihtms. I have a feeling that efficient distributed non-blocking
>algorihtms, which are comfortable running under a very weak cache coherency
>model will be all the rage. Getting rid of atomic RMW or StoreLoad style
>memory barriers is the first step.

That reminds me of an article / paper i once read about Cache Only
Memory Architecture (COMA). Only they did seem to be able to get it
to work though.

JosephKK

unread,

Aug 9, 2008, 12:02:53 PM8/9/08

to

On Wed, 06 Aug 2008 19:57:23 -0700, John Larkin
<jjla...@highNOTlandTHIStechnologyPART.com> wrote:

>On Tue, 5 Aug 2008 12:54:14 -0700, "Chris M. Thomasson"
><n...@spam.invalid> wrote:
>
>>"John Larkin" <jjla...@highNOTlandTHIStechnologyPART.com> wrote in message
>>news:rtrg9458spr43ss94...@4ax.com...
>>> On Tue, 5 Aug 2008 13:30:52 +0200, "Skybuck Flying"
>>> <Blood...@hotmail.com> wrote:
>>>
>>>>As the number of cores goes up the watt requirements goes up too ?
>>>
>>> Not necessarily, if the technology progresses and the clock rates are
>>> kept reasonable. And one can always throttle down the CPUs that aren't
>>> busy.
>>>
>>>>
>>>>Will we need a zillion watts of power soon ?
>>>>
>>>>Bye,
>>>> Skybuck.
>>>>
>>>
>>> I saw suggestions of something like 60 cores, 240 threads in the
>>> reasonable future.
>>
>>I can see it now... A mega-core GPU chip that can dedicate 1 core per-pixel.
>>
>>lol.
>>
>>
>>
>>

>>> This has got to affect OS design.
>>
>>They need to completely rethink their multi-threaded synchronization
>>algorihtms. I have a feeling that efficient distributed non-blocking
>>algorihtms, which are comfortable running under a very weak cache coherency
>>model will be all the rage. Getting rid of atomic RMW or StoreLoad style
>>memory barriers is the first step.
>

>Run one process per CPU. Run the OS kernal, and nothing else, on one
>CPU. Never context switch. Never swap. Never crash.
>

>John

OK. How do you deal with I/O devices, user input and hot swap?

John Larkin

unread,

Aug 9, 2008, 12:09:29 PM8/9/08

to

On Sat, 09 Aug 2008 09:02:53 -0700, JosephKK <quiett...@yahoo.com>
wrote:

I/O and user interface, just like now: device drivers and GUI's. Just
run them on separate CPUs, and have hardware control over anything
that could crash the system, specifically global memory mapping. There
have been OS's that, for example, pre-qualified the rights of DMA
controllers so even a rogue driver couldn't punch holes in memory at
random.

But hot swap? What do you mean? All the CPUs are on one chip.

John

JosephKK

unread,

Aug 9, 2008, 12:15:28 PM8/9/08

to

Why would it? The design could also use hundreds or thousands of
dedicated I/O controllers. If you want to talk about real
bottlenecks look at memory and data bus limitations.

John Larkin

unread,

Aug 9, 2008, 12:28:17 PM8/9/08

to

On Sat, 09 Aug 2008 09:15:28 -0700, JosephKK <quiett...@yahoo.com>
wrote:

A lot of hardware sorts of stuff, like tcp/ip stack accelerators,
coule be done in a dedicated cpu. Sort of like using a PIC to blink an
LED. Part of the channel-controller thing was driven by mot wanting to
burden an expensive CPU with scut work and interrupts and context
switching overhead. All that stops mattering when cpu's are free. Of
course, disk controllers and graphics processors would still be
needed, but simpler ones and fewer of them.

Multicore is especially interesting for embedded systems, where there
are likely a modest number of processes and no dynamic add/drop of
tasks. The most critical ones, like an important servo loop, could be
dedicated and brutally simple. Freescale is already going multicore on
embedded chips, and I think others are, too. The RTOS boys are *not*

Robert Myers

unread,

Aug 9, 2008, 12:30:06 PM8/9/08

to

On Aug 9, 12:15 pm, JosephKK <quiettechb...@yahoo.com> wrote:

>
> Why would it? The design could also use hundreds or thousands of
> dedicated I/O controllers. If you want to talk about real
> bottlenecks look at memory and data bus limitations.

mmhmm.

Bandwidth per flop is headed toward zero.

Robert.

John Larkin

unread,

Aug 9, 2008, 12:36:59 PM8/9/08

to

On Sat, 09 Aug 2008 09:15:28 -0700, JosephKK <quiett...@yahoo.com>

wrote:

What bottlenecks? Most PC's have speed to burn. What they don't have
is security, reliability, or simplicity. But more cpu's, each with a
little local ram, surrounding a shared cache, have got to be more
efficient than a single CPU thrashing between 60 or so processes.

Or maybe things will never change, just like they never changed in
past years.

John

JosephKK

unread,

Aug 9, 2008, 12:52:03 PM8/9/08

to

On Fri, 08 Aug 2008 18:03:09 +0100, Martin Brown
<|||newspam|||@nezumi.demon.co.uk> wrote:

Yeah, to people with broadband. Back when XP SP2 came out i was still
on dial up, MS send me a CD for free. Consider costs like that before
spouting.

>
>Since people buy the stuff (I would not wish Vista on my worst enemy by
>the way) even with all its faults the market rules, and market forces
>are never wrong...
>
>Most of what you are claiming as advantages of separate CPUs can be
>achieved just as easily with hardware support for protected user memory
>and security privilige rings. It is more likely that virtualisation of
>single, dual or quad cores will become common in domestic PCs.

Why virtualize them? I can have them physically. Of course M$ PC
style software still cannot use them efficiently. Nor can they use
64-bit effectively and usually make poor use of SSE, SSE2 etc.,

>
>There was a Pentium exploit documented against some brands of Unix. eg.
>http://www.ssi.gouv.fr/fr/sciences/fichiers/lti/cansecwest2006-duflot.pdf
>
>Loads of physical CPUs just creates a different set of complexity
>problems. And they are a pig to program efficiently.

Mostly due to MS-DOS and follow ons style group think. We have a
generation of programmers that never learned partitioning properly.

JosephKK

unread,

Aug 9, 2008, 1:03:36 PM8/9/08

to

I have run compute bound stuff on a PC that took hours (about 5 1/2 to
run) and i wrote it myself. It was clean and efficient, just compute
bound. I tried it on a recent machine, took about 10 minutes. Yet
the general performance of the general PC application on the typical
PC seems to have no performance improvement for the past 10 years.
What do you think is the cause?

>
>If the OS runs alone on one CPU, totally hardware protected from all
>other processes, totally in control, that's not complex.
>
>As transistors get smaller and cheaper, and cores multiply into the
>hundreds, the limiting resource will become power dissipation. So if
>every process gets its own CPU, and idle CPUs power down, and there's
>no context switching overhead, the multi-CPU system is net better off.
>
>What else are we gonna do with 1024 cores? We'll probably see it on
>Linux first.

We have already seen it on Linux, in the form of parallel
supercomputers. With more cores as well.

>
>John
>

John Larkin

unread,

Aug 9, 2008, 1:20:40 PM8/9/08

to

On Sat, 09 Aug 2008 10:03:36 -0700, JosephKK <quiett...@yahoo.com>
wrote:

A given program will run far faster on modern iron. But modern apps
have mostly factored increased cpu speed and memory into their
designs, and bloated up to match.

>
>>
>>If the OS runs alone on one CPU, totally hardware protected from all
>>other processes, totally in control, that's not complex.
>>
>>As transistors get smaller and cheaper, and cores multiply into the
>>hundreds, the limiting resource will become power dissipation. So if
>>every process gets its own CPU, and idle CPUs power down, and there's
>>no context switching overhead, the multi-CPU system is net better off.
>>
>>What else are we gonna do with 1024 cores? We'll probably see it on
>>Linux first.
>
>We have already seen it on Linux, in the form of parallel
>supercomputers. With more cores as well.

http://tunes.org/~unios/oskernels.html

"First, some words about the meaning of "kernel". Operating Systems
can be written so that most services are moved outside the OS core and
implemented as processes.This OS core then becomes a lot smaller, and
we call it a kernel. When this kernel only provides the basic
services, such as basic memory management ant multithreading, it is
called a microkernel or even nanokernel for the super-small ones. To
stress the difference between the

Unix-type of OS, the Unix-like core is called a monolithic kernel. A
monolithic kernel provides full process management, device
drivers,file systems, network access etc. I will here use the word
kernel in the broad sense, meaning the part of the OS supervising the
machine."

Most popular os's (Win, Linux, Unix) are big-kernel designs, to reduce
inter-process overhead. That makes them complex, buggy, and
paradoxically slow.

John

Jan Panteltje

unread,

Aug 9, 2008, 1:48:21 PM8/9/08

to

On a sunny day (Sat, 09 Aug 2008 10:20:40 -0700) it happened John Larkin
<jjla...@highNOTlandTHIStechnologyPART.com> wrote in
<12kr94p9sm7accdle...@4ax.com>:

>"First, some words about the meaning of "kernel". Operating Systems
>can be written so that most services are moved outside the OS core and
>implemented as processes.This OS core then becomes a lot smaller, and
>we call it a kernel. When this kernel only provides the basic
>services, such as basic memory management ant multithreading, it is
>called a microkernel or even nanokernel for the super-small ones. To
>stress the difference between the
>
>Unix-type of OS, the Unix-like core is called a monolithic kernel. A
>monolithic kernel provides full process management, device
>drivers,file systems, network access etc. I will here use the word
>kernel in the broad sense, meaning the part of the OS supervising the
>machine."

Just to rain a bit on your parade, in the *Linux* kernel,
many years ago, the concept of 'modules' was introduced.
Now device drivers are 'modules', and are, although closely connected, and in the same
source package, _not_ a real pert of the kernel.
(I am no Linux kernel expert, but it is absolutely possible to write a device
driver as module, and then, while the system is running, load that module,
and unload it again.
I sort of have the feeling that your knowledge of Linux, and the Linux kernel, is very academic John,
and you should really compile a kernel and play with Linux a bit to get
the feel of it.

>Most popular os's (Win, Linux, Unix) are big-kernel designs, to reduce
>inter-process overhead. That makes them complex, buggy, and
>paradoxically slow.

Unix has been around decades, got more and more perfectioned, Linux and BSD are incarnations of it.

There was some old saying that went like this (correct me hopefully somebody knows it more precisely):
"Those who criticise Unix are bound to re-invent it'.

Bill Todd

unread,

Aug 10, 2008, 5:58:13 AM8/10/08

to

Er, the discussion that John quoted above referred not to what is
compiled with the kernel but to what executes in the same protection
domain that the kernel does (as it is my impression Linux modules do).
Perhaps John is not the one who needs to develop a deeper understanding
here.

- bill

Jan Panteltje

unread,

Aug 10, 2008, 6:38:01 AM8/10/08

to

On a sunny day (Sun, 10 Aug 2008 05:58:13 -0400) it happened Bill Todd
<bill...@metrocast.net> wrote in
<1aqdnfjG5tCEJgPV...@metrocastcablevision.com>:

>> Just to rain a bit on your parade, in the *Linux* kernel,
>> many years ago, the concept of 'modules' was introduced.
>> Now device drivers are 'modules', and are, although closely connected, and in the same
>> source package, _not_ a real pert of the kernel.
>> (I am no Linux kernel expert, but it is absolutely possible to write a device
>> driver as module, and then, while the system is running, load that module,
>> and unload it again.
>> I sort of have the feeling that your knowledge of Linux, and the Linux kernel, is very academic John,
>> and you should really compile a kernel and play with Linux a bit to get
>> the feel of it.
>
>Er, the discussion that John quoted above referred not to what is
>compiled with the kernel but to what executes in the same protection
>domain that the kernel does (as it is my impression Linux modules do).
>Perhaps John is not the one who needs to develop a deeper understanding
>here.

He mentioned 'monolithic', and with modules, the Linux kernel is _not_ monolitic.
You can load a device driver as a module (after you configured it to be a module
before compilation, the kernel config gives you often a choice), and
then that module will even be dynamically loaded, including other modules it depends on,
and unloaded too if no longer used (that device).
This keeps memory usage low, and prevent that you need to reboot if you add a new driver.

As to 'protection domain' be aware that even if you were to run device drivers on a different core (one for each device???)
then you will still have to move the data from one core to the other for processing, and
how protected do you think that data is? It is all illusion: 'More cores will solve everything.'.
I wonder how many here actually use Linux, compiled a kernel, wrote modules and applications,
and even can write in C.
I'd rather have a discussion with them, then the generalised bloating about systems they never even
had hands on experience with.
In that case sci.electronics.design becomes like sci.physics, bunch of idiots with even
more idiotic theories causing so much noise that the real stuff is obscured, and your chance to learn something
is zero.
This is my personal rant, I am a Linux user, written many applications for it, did some work on
drivers too.
Academic bullshit I know about too, in my first year Information Technology I found an error in the
text book, reported it, professors do not always like to be corrected, I learned that.
There was a project that you could join, about in depth study of operating systems, and, since I actually
wrote one, I applied for the project, was promptly rejected.
Where did those guys go? Microsoft??????
I will listen to John Larkin's theory about how safe multicore systems are after he writes a demo, or even
shows someone else's that cannot be corrupted.
Utopia does not exist.

Chris M. Thomasson

unread,

Aug 10, 2008, 7:04:17 AM8/10/08

to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message
news:g7f3mq$shf$1...@gemini.csx.cam.ac.uk...
>
> In article <PNDmk.8961$Bt6....@newsfe04.iad>,
> "Chris M. Thomasson" <n...@spam.invalid> writes:
> |>
> |> FWIW, I have a memory allocation algorithm which can scale because its
> based
> |> on per-thread/core/node heaps:
> |>
> |> AFAICT, there is absolutely no need for memory-allocation cores. Each
> thread
> |> can have a private heap such that local allocations do not need any
> |> synchronization.
>
> Provided that you can live with the constraints of that approach.
> Most applications can, but not all.

That's a great point! It just seems that the approach could possibly be
beneficial to all sorts of applications. Could you help me out here and give
some examples of a couple of applications that simply could not tolerate the
approach at any level? When I say any level I mean allocations starting at
lowest common denominator from it orgin... This being trying thread local
heap, then core local heap, and so on and so forth...

I see problems. Well, with mega-core systems, the per-core memory is going
to be limited indeed! Its analogous to programming a Cell with its dedicated
per-SPE memory; something like 256 kb. When the local allocation to a SPE is
exhausted, well, DMA to the global memory is going to need to be utilized. I
know this works because I have played around with algorithms using the IBM
Cell Simulator.

http://groups.google.com/group/comp.arch/browse_frm/thread/4c97441d6704d8a1

http://groups.google.com/group/comp.arch/msg/4133f6eb8a6b5a74

programming the Cell is VERY FUN!!!!

Phil Hobbs

unread,

Aug 10, 2008, 7:33:55 AM8/10/08

to

In order to maintain cache coherence, interconnect bandwidth wants to go as
the square or the cube of Moore's Law, depending on your assumptions (Rent's
rule might make it the 1.5th or 2.5th power, but not less than that). In
many-processor SMPs, that bandwidth dominates. Hence there's a move afoot
towards specialization, as in the Cell, which is a SIMD machine like an old Cray.

The cache coherence problem is a thorny one, because if full coherence is
relaxed very much,
(a) programming gets much much harder, and
(b) the range of problems that the machine can tackle efficiently drops like
a rock.

Thus I'm not sure what local storage allocation really gets you, because ISTM
it's a smallish piece of a much bigger and thornier problem.

There are intermediate design points, such as an MxN-way system, with M N-way
SMPs. If N stops scaling, that makes the cache coherence problem easier and
saves interconnect power.

As the old saying goes, computer design is 'bottleneckology'.

Cheers,

Phil Hobbs

Piotr Wyderski

unread,

Aug 10, 2008, 8:14:30 AM8/10/08

to

Terje Mathisen wrote:

> When Intel can manufacture 32 or 64 LRB cores on a single chip, using
> the same process as the 8 Core 2 cores that fit in the same area, I
> certainly expect something we've been talking about for years, i.e.
> heterogeneous multi-processing:
>
> Maybe 2-4 Core-class fast cores, and 32-48 LRB cores, with an OS which
> knows about the different needs of different applications.

IMHO rather something between homo- and heterognous.
A fully heterogenous system can be composed of any imaginable
CPU architectures, but is tedious to program, at least without
a virtualization layer. So in my opinion it will be a composition
of equally capable CPUs (in terms of features), so it will be
able to painlessly migrate threads between cores, but the cores
may be implemented differently. So logically they will be
homogenous, but physically heterogenous. Which seems to
be a fair trade-off.

Best regards
Piotr Wyderski

Message has been deleted

Jan Panteltje

unread,

Aug 10, 2008, 11:02:40 AM8/10/08

to

On a sunny day (Sun, 10 Aug 2008 06:53:56 -0700) it happened AnimalMagic
<Anima...@petersbackyard.org> wrote in
<lhst941nf300ad7lb...@4ax.com>:

>On Sun, 10 Aug 2008 10:38:01 GMT, Jan Panteltje
><pNaonSt...@yahoo.com> wrote:
>
>>Utopia does not exist.
>
>
> Thanks to dopey, closed mindsets like yours.

Hey nutcase, YOU failed to hack it !

> After 3 years, folks are still trying to circumnavigate Sony's
>hypervisor control over the graphics port on the PS3, so those of us that
>run Linux on it cannot get accelerated graphics or GL performance on it.
>
> Apparently for them, Utopia's castle walls are still standing.

Learn to wipe your own arse.

Jan Panteltje

unread,

Aug 10, 2008, 11:10:09 AM8/10/08

to

On a sunny day (Sun, 10 Aug 2008 15:02:40 GMT) it happened Jan Panteltje
<pNaonSt...@yahoo.com> wrote in <g7mvuk$2mc$1...@aioe.org>:

And for the others: Sony was to have two HDMI ports on the PS3,
should have made for interesting experiments.

But the real PS3 only had one, so I decided to skip on the Sony product
(most Sony products I have bought in the past were really bad actually).
And Linux you can run on anything (and runs on anything), for less then
the cost of a PS3 you can assemble a good PC, so if you must run Linux
why bother tortuing yourself on a PS3? Use a real computer.

But perhaps if you are one of those gamers... well
the video modes also suck on that thing. And the power consumption is high,
not green at all, and it does not have that nice Nintendo remote.
:-)))))))))))))))))))))))))))))))))))))))))

ChrisQ

unread,

Aug 10, 2008, 12:36:53 PM8/10/08

to

Jan Panteltje wrote:

> Do not forget, since the days of 8088, and maybe CPUs running at
> about 13 MHz, we now run at 3.4 GHz, 3400 / 13 = 261 x faster. Also
> even faster because of better architectures. This leaves plenty of
> time for a CPU to do normal IO. And in fact the IO has been hardware
> supported always. For example, although you can poll a serial port
> bit by bit, there is a hardware shift register, hardware FIFO too.
> Although you can construct sectors for a floppy in software bit by
> bit, there is a floppy controller with write pre-compensation etc..
> all in hardware. Although you could do graphics there is a graphics
> card with hardware acceleration. the first 2 are included in the chip
> set, maybe the graphics too. The same thing for Ethernet, it is a
> dedicated chip, or included in the chip set, taking the place of your
> 'IO processor'. Same thing for hard disks, and those may even have
> on board encryption, all you have to do is specify a sector number
> and send the sector data.
>
> So.. no real need for a separate IO processor, in fact you likely
> find a processor in all that dedicated hardware, or maybe a FPGA.
>
>

The op was right If you task the main cpu with dealing
with all the io, the cpu then ends up being interrupted for minor stuff
like a keypress, disk block io completion, network tx and rx completion etc.
Dma processes also typically steal cpu cycles during a data transfer.
If the cpu is interrupted, would assume all the instruction cache and
stream gets flushed every time, which might not be helpfull either. It
may not have much impact for the average application, but for cpu + i/o
intensive apps, it might make a lot of difference.

Hardware is cheap now - a much better way would be to virtualise all the
io into high level commands. An i/o processor which has it's own memory
and bus which would provide high level i/o and file system services as an
abstraction to the os. It would probably use a shared memory region (at
hardware level) to transfer the data.

Still, a 25 year old pc architecture still rules the world. We may have
faster hardware, but in absolute terms, it's still an abomination......

Chris

ChrisQ

unread,

Aug 10, 2008, 1:05:31 PM8/10/08

to

Jan Panteltje wrote:

> John Lennon:
>
> 'You know I am a dreamer' .... ' And I hope you join us someday'
>
> (well what I remember of it). You should REALLY try to program a Cell
> processor some day.
>
> Dunno what you have against programmers, there are programmaers who
> are amazingly clever with hardware resources. I dunno about NT and
> MS, but IIRC MS plucked programmers from unis, and sort of
> brainwashed them then.. the result we all know.
>
>

That's just the problem - programmers have been so good at hiding the
limitations of poorly designed hardware that the whole world thinks
that hardware must be perfect and needs no attention other than making
it go faster.

If you look at some modern i/o device architectures, it's obvious the
hardware engineers never gave a second thought about how the thing would
be programmed efficiently...

Chris (with embedded programmer hat on :-(

ChrisQ

unread,

Aug 10, 2008, 1:08:14 PM8/10/08

to

UltimatePatriot wrote:

>
> The Cell BE IS the current future.
>
> VERY powerful. Ten times that of a PC in MANY areas. It will
> improve too.

i/o channel architecture on a chip ?. They probably had it right decades
ago with mainframes. It just took the rest of the world a while to catch
up...

Chris

Jan Panteltje

unread,

Aug 10, 2008, 1:24:59 PM8/10/08

to

On a sunny day (Sun, 10 Aug 2008 17:05:31 +0000) it happened ChrisQ
<blac...@devnull.com> wrote in <g7n75m$vi$1...@aioe.org>:

Interesting.
For me, I have a hardware background, but also software, the two
came together with FPGA, when I wanted to implement DES as fast as possible.
I did wind up with just a bunch of gates and 1 clock cycle, so no program :-)
No loops (all unfolded in hardware).
So, you need to define some boundary between hardware resources (that one used a lot of gates),
and software resources, I think.

Tim Williams

unread,

Aug 10, 2008, 1:29:52 PM8/10/08

to

"ChrisQ" <blac...@devnull.com> wrote in message
news:g7n75m$vi$1...@aioe.org...

> That's just the problem - programmers have been so good at hiding the
> limitations of poorly designed hardware

Is that like the crummy WinModems?

Tim

--
Deep Friar: a very philosophical monk.
Website: http://webpages.charter.net/dawill/tmoranwms

John Larkin

unread,

Aug 10, 2008, 1:32:17 PM8/10/08

to

On Sun, 10 Aug 2008 10:38:01 GMT, Jan Panteltje
<pNaonSt...@yahoo.com> wrote:

What does C have to do with it, other than being a contributor to the
chaos that modern computing is? More big programming projects fail
than ever make it to market. OS's are commonly shipped with hundreds
or sometimes thousands of bugs. Serious damage to consumers, business,
and US national security has been compromised through the criminally
stupid design of Windows. Lots of people are refusing to upgrade their
apps because the newer releases are bigger, slower, and more fragile
than the older ones. In products with hardware, HDL-based logic, and
firmware, it's nearly always the firmware that's full of bugs. If
engineers can write bug-free VHDL, which they usually do, why can't
programmers write bug-free C, which they practically never do?

Things are broken, and we need a change. Since hardware works, and
software doesn't, we heed more of the former with more control over
less of the latter. Fortunately, that *will* happen, and multicore is
one of the drivers.

>I'd rather have a discussion with them, then the generalised bloating about systems they never even
>had hands on experience with.
>In that case sci.electronics.design becomes like sci.physics, bunch of idiots with even
>more idiotic theories causing so much noise that the real stuff is obscured, and your chance to learn something
>is zero.
>This is my personal rant, I am a Linux user, written many applications for it, did some work on
>drivers too.
>Academic bullshit I know about too, in my first year Information Technology I found an error in the
>text book, reported it, professors do not always like to be corrected, I learned that.
>There was a project that you could join, about in depth study of operating systems, and, since I actually
>wrote one, I applied for the project, was promptly rejected.
>Where did those guys go? Microsoft??????
>I will listen to John Larkin's theory about how safe multicore systems are after he writes a demo, or even
>shows someone else's that cannot be corrupted.
>Utopia does not exist.

I have stated no theories. I have observed that the number of cores
per CPU chip is increasing radically, that Moore's law has
repartitioned itself away from raw CPU complexity and speed into
multiple, relatively modest processors. That this is happening across
the range of processors, scientific and desktop and embedded. Are you
denying that this is happening?

If not, do you have any opinions on whether having hundreds of fairly
fast CPUs, instead of one blindingly-fast one, will change OS design?
Will it change embedded app design?

If you have no opinions, and can conjecture no change, why do you get
mad at people who do, and can? Why do you post in a group that has
"design" in its name? Maybe you should start and moderate
sci.electronics.tradition.

John

Message has been deleted

Jan Panteltje

unread,

Aug 10, 2008, 1:48:42 PM8/10/08

to

On a sunny day (Sun, 10 Aug 2008 10:32:17 -0700) it happened John Larkin
<jjla...@highNOTlandTHIStechnologyPART.com> wrote in
<k38u949ebukbdr3hi...@4ax.com>:

HI JOHN
elwctronics design is not (!= in C ;-) ) software design.
Just stating there will be more cores on a chip is obvious,
we knew that for years.

Stating that more cores will improve _reliability_
(in the widest sense of the word) as you seem to
(at least that is what I understand from your postings),
puts the burden of proof on you.

You call software bad, yet you claim your own small asm programs are perfect,
this makes one suspicious.

There is a lot of good software, I would say that software that does what
it is intended to do, and does that without crashing, is good software.
If that software runs on good hardware you can do a lot with it.
All the problems with MS operating system are alien to me, the last MS OS I bought was win98SE, I
still have it on a PC, and it does occasionally misbehave, use it
for my Canon scanner, and DVD layout sometimes.
I will not go online with it.....
All other things run various versions / distributions of Linux, think
I have tried most of these, all but RatHead worked OK.

So I do not really see your problem, things do not crash,
the soft I wrote myself does not crash,
things do not get infected with trojans, virusses, worms, other things...
I have a very good firewall (iptables), latest DNS fixes, this server has now
been running since 2004, still with the same Seagate harddisk...

What is your problem?
As to computer languages, the portability of C will help you out big time
once you want to run that same stable application on say a MIPS platform,
or any other processor.
Re-writing your code in ASM for each new platform is asking for bugs,
so C is an universal solution.
Especially for more complex programs.
AND operating systems.

Jan Panteltje

unread,

Aug 10, 2008, 1:50:46 PM8/10/08

to

On a sunny day (Sun, 10 Aug 2008 10:37:59 -0700) it happened AnimalMagic
<Anima...@petersbackyard.org> wrote in
<np9u94l9en47rkd3k...@4ax.com>:

>On Sun, 10 Aug 2008 15:02:40 GMT, Jan Panteltje

><pNaonSt...@yahoo.com> wrote:
>
>>On a sunny day (Sun, 10 Aug 2008 06:53:56 -0700) it happened AnimalMagic
>><Anima...@petersbackyard.org> wrote in
>><lhst941nf300ad7lb...@4ax.com>:
>>
>>>On Sun, 10 Aug 2008 10:38:01 GMT, Jan Panteltje
>>><pNaonSt...@yahoo.com> wrote:
>>>
>>>>Utopia does not exist.
>>>
>>>
>>> Thanks to dopey, closed mindsets like yours.
>>
>>Hey nutcase,
>
>

> You're an idiot.

>
>> YOU failed to hack it !
>

> I never attempted to hack it, dipshit.

So you wait for others to do it for you?
Then you can 'do the little trick' like a monkey.
You must be posting from alt.monkeys.
Bye

Jan Panteltje

unread,

Aug 10, 2008, 1:52:14 PM8/10/08

to

On a sunny day (Sun, 10 Aug 2008 10:46:26 -0700) it happened AnimalMagic
<Anima...@petersbackyard.org> wrote in
<1t9u9411k8930thle...@4ax.com>:

>On Sun, 10 Aug 2008 15:10:09 GMT, Jan Panteltje
><pNaonSt...@yahoo.com> wrote:
>
>>On a sunny day (Sun, 10 Aug 2008 15:02:40 GMT) it happened Jan Panteltje
>><pNaonSt...@yahoo.com> wrote in <g7mvuk$2mc$1...@aioe.org>:
>>
>>And for the others: Sony was to have two HDMI ports on the PS3,
>>should have made for interesting experiments.
>>
>>But the real PS3 only had one, so I decided to skip on the Sony product
>>(most Sony products I have bought in the past were really bad actually).
>>And Linux you can run on anything (and runs on anything),
>

> Where do you hook up the keyboard and monitor to your wireless LAN
>Router, running Linux at?

RS232 terminal dummy.
http://panteltje.com/panteltje/wap54g/index.html#wapserver

Terje Mathisen

unread,

Aug 10, 2008, 2:04:15 PM8/10/08

to

Piotr Wyderski wrote:

> Terje Mathisen wrote:
>> Maybe 2-4 Core-class fast cores, and 32-48 LRB cores, with an OS which
>> knows about the different needs of different applications.
>
> IMHO rather something between homo- and heterognous.
> A fully heterogenous system can be composed of any imaginable
> CPU architectures, but is tedious to program, at least without
> a virtualization layer. So in my opinion it will be a composition
> of equally capable CPUs (in terms of features), so it will be
> able to painlessly migrate threads between cores, but the cores
> may be implemented differently. So logically they will be
> homogenous, but physically heterogenous. Which seems to
> be a fair trade-off.

It is possible that Windows (i.e. the lack of a good enough OS
scheduler) will force Intel to do this, but it should be nearly trivial
to either add a coupe of requirement bits to executables and capability
bits to cores, or (with a better OS) let a core trap if asked to run an
unsupported opcode, migrate the task to a suitable core, and then use
this to dynamically update the task affinity mask.

Terje

--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Stephen Fuld

unread,

Aug 10, 2008, 2:12:59 PM8/10/08

to

ChrisQ wrote:
> Jan Panteltje wrote:
>
>> Do not forget, since the days of 8088, and maybe CPUs running at about
>> 13 MHz, we now run at 3.4 GHz, 3400 / 13 = 261 x faster. Also even
>> faster because of better architectures. This leaves plenty of time for
>> a CPU to do normal IO. And in fact the IO has been hardware supported
>> always. For example, although you can poll a serial port bit by bit,
>> there is a hardware shift register, hardware FIFO too. Although you
>> can construct sectors for a floppy in software bit by bit, there is a
>> floppy controller with write pre-compensation etc.. all in hardware.
>> Although you could do graphics there is a graphics card with hardware
>> acceleration. the first 2 are included in the chip
>> set, maybe the graphics too. The same thing for Ethernet, it is a
>> dedicated chip, or included in the chip set, taking the place of your
>> 'IO processor'. Same thing for hard disks, and those may even have on
>> board encryption, all you have to do is specify a sector number and
>> send the sector data.
>>
>> So.. no real need for a separate IO processor, in fact you likely find
>> a processor in all that dedicated hardware, or maybe a FPGA.
>>
>>
>
> The op was right If you task the main cpu with dealing
> with all the io, the cpu then ends up being interrupted for minor stuff
> like a keypress,

Several points. If you are dealing with keypresses, you are talking
human time scale interactions which are huge compared to other things,
so are not a major cause of performance hits. Also, it probably means
you are not dealing with a compute bound, long running program. Lastly,
the program must do something as a result of the keystroke, at least
update the display, and for some keys (e.e. Enter) probably a lot more,
so it is going to be interrupted anyway.

disk block io completion,

If you mean one of multiple 512 byte blocks of a longer (e.g. 4K) disk
transfer, no modern disk interrupts on those anymore. If you mean the
completion of the say 4K transfer, the the main program probably wants
to be interrupted because it wants to do something with the data just
read in.

network tx and rx completion

If you mean individual packets, then I agree completely, but it you mean
the whole message, then the same argument as the disk transfer above
applies.

> etc.
> Dma processes also typically steal cpu cycles during a data transfer.

No, they don't. They take memory cycles, but that is unavoidable, as
that is what they are designed to do after all. If the CPU is executing
out of cache, then it looses nothing due to an external DMA transfer.

There is merit in offloading as much of the "scut" work of IO, but a lot
of it has already been done by smarter peripherals, etc.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Jan Panteltje

unread,

Aug 10, 2008, 2:15:16 PM8/10/08

to

On a sunny day (Sun, 10 Aug 2008 10:46:26 -0700) it happened AnimalMagic
<Anima...@petersbackyard.org> wrote in
<1t9u9411k8930thle...@4ax.com>:

> Where do you hook up the keyboard and monitor to your wireless LAN
>Router, running Linux at?

RS232 terminal dummy.
http://panteltje.com/panteltje/wap54g/index.html#wapserver

Actually, I have the RS232 disconnected, as all is working so well.
I use telnet to access the wap server, it is faster, and does
not interfere with normal operations.
It allows me to start and stop processes, get logfiles,
set wireless on and off, etc.
Screenshot simple telnet session to the wap server from an other PC:
ftp://panteltje.com/pub/wap.gif

John Larkin

unread,

Aug 10, 2008, 3:09:49 PM8/10/08

to

On Fri, 08 Aug 2008 16:55:14 GMT, Jan Panteltje
<pNaonSt...@yahoo.com> wrote:

>On a sunny day (Fri, 08 Aug 2008 08:54:36 -0700) it happened John Larkin
><jjla...@highNOTlandTHIStechnologyPART.com> wrote in
><8v4m945fbcvrln66t...@4ax.com>:

>
>>>/One/ bottleneck is the cache-coherency system.
>>>
>>>
>>
>>I think the trend is to have the cores surround a common shared cache;
>>a little local memory (and cache, if the local memory is slower for
>>some reason) per CPU wouldn't hurt.
>>

>>Cache coherency is simple if you don't insist on flat-out maximum
>>performance. What we should insist on is flat-out unbreakable systems,
>>and buy better silicon to get the performance back if we need it.
>>

>>I'm reading Showstopper!, the story of the development of NT. It's a
>>great example of why we need a different way of thinking about OS's.
>>

>>Silicon is going to make that happen, finally free us of the tyranny
>>of CPU-as-precious-resource. A lot of programmers aren't going to like
>>this.
>>

>>John

>
>John Lennon:
>
>'You know I am a dreamer'
>....
>' And I hope you join us someday'
>
>(well what I remember of it).
>You should REALLY try to program a Cell processor some day.
>
>Dunno what you have against programmers, there are programmaers who
>are amazingly clever with hardware resources.
>I dunno about NT and MS, but IIRC MS plucked programmers from
>unis, and sort of brainwashed them then.. the result we all know.

That's not what happened. They hired David Cutler from DEC, where he
had worked on VMS, and pretty much left him alone. The chaos was and
is part of the culture of modern programming.

John

John Larkin

unread,

Aug 10, 2008, 3:22:17 PM8/10/08

to

On Sun, 10 Aug 2008 17:48:42 GMT, Jan Panteltje
<pNaonSt...@yahoo.com> wrote:

Small programs written by one person can be perfect, and often are. So
why not have the OS kernel be small, and written by one person?

Why waste X-prizes on solar cars and suborbital tourism? How about...

$10 million for a firm specification of a multiple-CPU OS architecture
based on a nanokernel design.

Another $10M for public-domain code that implements that kernel.

A final $10M for a working OS using the above.

For a mere $30 million, about 1/5 of what the first release of what NT
cost, we could change the world.

(I think Vista is an attempt at a smaller kernel, but it pays a big
price in overhead.)

>
>There is a lot of good software, I would say that software that does what
>it is intended to do, and does that without crashing, is good software.
>If that software runs on good hardware you can do a lot with it.
>All the problems with MS operating system are alien to me, the last MS OS I bought was win98SE, I
>still have it on a PC, and it does occasionally misbehave, use it
>for my Canon scanner, and DVD layout sometimes.
>I will not go online with it.....
>All other things run various versions / distributions of Linux, think
>I have tried most of these, all but RatHead worked OK.
>
>So I do not really see your problem, things do not crash,
>the soft I wrote myself does not crash,
>things do not get infected with trojans, virusses, worms, other things...
>I have a very good firewall (iptables), latest DNS fixes, this server has now
>been running since 2004, still with the same Seagate harddisk...
>
>What is your problem?
>As to computer languages, the portability of C will help you out big time
>once you want to run that same stable application on say a MIPS platform,
>or any other processor.
>Re-writing your code in ASM for each new platform is asking for bugs,
>so C is an universal solution.

Changing "platforms" in an embedded system is such a hassle that the
code is a fraction of the effort. It's rarely done.

And C is not portable in embedded systems. Assuming it is is begging
for bugs.

>Especially for more complex programs.

Don't do that.

>AND operating systems.

Don't do that, either.

John

Bill Todd

unread,

Aug 10, 2008, 3:25:00 PM8/10/08

to

Jan Panteltje wrote:
> On a sunny day (Sun, 10 Aug 2008 05:58:13 -0400) it happened Bill Todd
> <bill...@metrocast.net> wrote in
> <1aqdnfjG5tCEJgPV...@metrocastcablevision.com>:
>
>>> Just to rain a bit on your parade, in the *Linux* kernel,
>>> many years ago, the concept of 'modules' was introduced.
>>> Now device drivers are 'modules', and are, although closely connected, and in the same
>>> source package, _not_ a real pert of the kernel.
>>> (I am no Linux kernel expert, but it is absolutely possible to write a device
>>> driver as module, and then, while the system is running, load that module,
>>> and unload it again.
>>> I sort of have the feeling that your knowledge of Linux, and the Linux kernel, is very academic John,
>>> and you should really compile a kernel and play with Linux a bit to get
>>> the feel of it.
>> Er, the discussion that John quoted above referred not to what is
>> compiled with the kernel but to what executes in the same protection
>> domain that the kernel does (as it is my impression Linux modules do).
>> Perhaps John is not the one who needs to develop a deeper understanding
>> here.
>
> He mentioned 'monolithic', and with modules, the Linux kernel is _not_ monolitic.

I've snipped the rest of your drivel, Jan, because the above says it all.

Before you make even more of an ass of yourself, why not actually to a
Google search on 'monolithic kernel' to get some idea of how it's
actually defined by those who know what they're talking about?

- bill

Jan Panteltje

unread,

Aug 10, 2008, 3:44:44 PM8/10/08

to

On a sunny day (Sun, 10 Aug 2008 15:25:00 -0400) it happened Bill Todd
<bill...@metrocast.net> wrote in
<xf2dna_cPN5togLV...@metrocastcablevision.com>:

So I did, and the things is not so sharply bounded as it may seem.
You can have a device driver in user space in Linux too, I have done that too.
I am referring to
http://upload.wikimedia.org/wikipedia/commons/d/d0/OS-structure2.svg
I agree my definition of monolitic is slightly different from yours and wikipedia.
If you know what you are talking about I dunno, maybe you do.

Jan Panteltje

unread,

Aug 10, 2008, 3:52:42 PM8/10/08

to

On a sunny day (Sun, 10 Aug 2008 12:22:17 -0700) it happened John Larkin
<jjla...@highNOTlandTHIStechnologyPART.com> wrote in
<gdfu94p2jcn004o6m...@4ax.com>:

>>Stating that more cores will improve _reliability_
>>(in the widest sense of the word) as you seem to
>>(at least that is what I understand from your postings),
>>puts the burden of proof on you.
>>
>>You call software bad, yet you claim your own small asm programs are perfect,
>>this makes one suspicious.
>
>Small programs written by one person can be perfect, and often are. So
>why not have the OS kernel be small, and written by one person?

QNX has just made their sources public, you still need a license for commercial use though.
Are you thinking that way?
In the eighties I worked with somebody who really liked QNX, I was into Unix.
Unix in the form of Linux solves a lot of programming problems, but brought the real-time
problem of task switching breaking some MSDOS like apps.
So in a way required more hardware.

>Why waste X-prizes on solar cars and suborbital tourism? How about...
>
>$10 million for a firm specification of a multiple-CPU OS architecture
>based on a nanokernel design.
>
>Another $10M for public-domain code that implements that kernel.
>
>A final $10M for a working OS using the above.
>
>For a mere $30 million, about 1/5 of what the first release of what NT
>cost, we could change the world.
>
>(I think Vista is an attempt at a smaller kernel, but it pays a big
>price in overhead.)

Why spend all those millions, we _have_ Linux, it works, and - it is mostly
written in C -, and because of that relatively easy portable to many platforms.

Bill Todd

unread,

Aug 10, 2008, 4:19:19 PM8/10/08

to

While there is a minor amount of boundary fuzzing at the edges (for
example, NT originally claimed to be microkernel- or at least 'hybrid
kernel'-based due to its use of separate processes to handle security
and the 'personalities'), NT and its descendants are basically still
monolithic kernels (and Linux doesn't even have that fig-leaf to hide
behind: it's monolithic, period).

> You can have a device driver in user space in Linux too, I have done that too.

That is not, however, what you were talking about: you were talking
about Linux kernel modules, which are part of the monolithic Linux
kernel despite being nicely modularized and loadable on demand (not that
Linux is anything special in having loadable-on-demand drivers, you
understand: for example, dynamically-loadable drivers were designed
into NT's first release in 1993, and the kernel proper - while
monolithic - was definitely modular).

> I am referring to
> http://upload.wikimedia.org/wikipedia/commons/d/d0/OS-structure2.svg

That diagram clearly defines the difference between monolithic kernels
and microkernels to be based on where the protection domain boundaries
fall, not modularity.

- bill

Message has been deleted

Kim Enkovaara

unread,

Aug 11, 2008, 3:11:23 AM8/11/08

to

John Larkin wrote:
> than the older ones. In products with hardware, HDL-based logic, and
> firmware, it's nearly always the firmware that's full of bugs. If
> engineers can write bug-free VHDL, which they usually do, why can't
> programmers write bug-free C, which they practically never do?

There is no such thing as bug free HDL. The bug density is just usually
lower especially in ASICs. The main reason for that is more thorough
testing of the code, because respins of the chips is slow and expensive.
In FPGAs when you can always do an update the bug densities are much
higher in the beginning.

--Kim

Kim Enkovaara

unread,

Aug 11, 2008, 3:20:33 AM8/11/08

to

John Larkin wrote:
>> Re-writing your code in ASM for each new platform is asking for bugs,
>> so C is an universal solution.
>
> Changing "platforms" in an embedded system is such a hassle that the
> code is a fraction of the effort. It's rarely done.

At least in the high-end of the embedded systems processor updates and
model changes are quite frequent. The lifetimes of processors
and their peripherials (especially DRAM memories) is becoming shorter
all the time. The code has to be portable and easily adaptable to
different platforms.

> And C is not portable in embedded systems. Assuming it is is begging
> for bugs.

C is very portable in embedded systems as far as I have seen. Some very
minimal processors have weird compilers, but the bigger processors
usually have gcc support, and also the commercial compilers support
the C same way as gcc.

>> Especially for more complex programs.
>
> Don't do that.

High-end embedded systems can easily contain 10Mloc of code, and that
amount is needed to support all the required features.

--Kim

Guy Macon

unread,

Aug 11, 2008, 5:09:11 AM8/11/08

to

Kim Enkovaara wrote:
>
>John Larkin wrote:
>
>>> Re-writing your code in ASM for each new platform is asking for bugs,
>>> so C is an universal solution.
>>
>> Changing "platforms" in an embedded system is such a hassle that the
>> code is a fraction of the effort. It's rarely done.
>
>At least in the high-end of the embedded systems processor updates and
>model changes are quite frequent. The lifetimes of processors
>and their peripherials (especially DRAM memories) is becoming shorter
>all the time. The code has to be portable and easily adaptable to
>different platforms.

And at the very low end, changes to completey different processors
are also very common. If someone comes up with a micro that costs
8.4 cents and replaces a part that costs 8.5 cents, that's a saving
of $16,800 per week at a production rate of 100,000 units per hour.
After a while you get the attitude of "ho hum, another assembly
language instruction set."

As for "asking for bugs", I find that working with masked rom
parts with a big setup fee and a minimum order of 10,000 parts
clarifies the mind quite nicely.

--
Guy Macon
<http://www.GuyMacon.com/>

Piotr Wyderski

unread,

Aug 11, 2008, 7:51:37 AM8/11/08

to

Terje Mathisen wrote:

> It is possible that Windows (i.e. the lack of a good enough OS scheduler)
> will force Intel to do this, but it should be nearly trivial to either add
> a coupe of requirement bits to executables and capability bits to cores,
> or (with a better OS) let a core trap if asked to run an unsupported
> opcode, migrate the task to a suitable core, and then use this to
> dynamically update the task affinity mask.

It's not that simple. The trap may be caused by any part of the process,
among
other things a dynamicly loaded plugin or a shared library. This way the OS
may
pretty soon end up with most of its processes migrated to the most capable
CPU
and the other cores will go to sleep. On a set of homogenous (but not
equally
fast, i.e. complex) cores there is no possibility of such a contention.
Moreover,
the OS can monitor process' performance counters in order to send the most
CPU-bound threads to a faster core and the IO-bound ones to a slower one.

It shouldn't be that hard for Intel to build such a chip: they already have
Core2-class cores and can design the simpler ones with relaxed performance
expectations: in-order, narrow data paths, microcoding everything more
complex than addition etc.

Best regards
Piotr Wyderski

Rob Warnock

unread,

Aug 11, 2008, 8:22:27 AM8/11/08

to

Guy Macon <http://www.GuyMacon.com/> wrote:
+---------------

| As for "asking for bugs", I find that working with masked rom
| parts with a big setup fee and a minimum order of 10,000 parts
| clarifies the mind quite nicely.

+---------------

Yup! In the early days of DCA[1], the EEPROMs we used in our remote
concentrators were still *very* expensive, and there were several times
where if we'd had to send out replacements to all the units in the
field[2] we simply wouldn't have been able to make the next payroll!! ;-}
So we *had* to make very, very sure that we didn't have any bugs.
And, yes, we were writing exclusively in PDP-8 & Z-80 assembler.[3]

-Rob

[1] Digital Communications Associates in Atlanta, not the ".gov" one.

[2] You have to send out the new ones before anyone will stop using
the production equipment long enough to send you back the old
ones for re-programming!

[3] Well, we actually wrote BLISS-like pseudo-code (which got left
in the comments) and then "hand-compiled" it to assembler.
But both the BLISS pseudo-code and the assembler got line-by-line
reviews by multiple people. Expensive, but quite reliable.

-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607

Terje Mathisen

unread,

Aug 11, 2008, 8:57:28 AM8/11/08

to

Piotr Wyderski wrote:
> Terje Mathisen wrote:
>
>> It is possible that Windows (i.e. the lack of a good enough OS
>> scheduler) will force Intel to do this, but it should be nearly
>> trivial to either add a coupe of requirement bits to executables and
>> capability bits to cores, or (with a better OS) let a core trap if
>> asked to run an unsupported opcode, migrate the task to a suitable
>> core, and then use this to dynamically update the task affinity mask.
>
> It's not that simple. The trap may be caused by any part of the process,
> among other things a dynamicly loaded plugin or a shared library. This
> way the OS may pretty soon end up with most of its processes migrated
> to the most capable CPU and the other cores will go to sleep. On a
> set of homogenous (but not equally fast, i.e. complex) cores there is
> no possibility of such a contention.

That's valid, except the fact that it would be quite trivial to
determine that the trapping opcode was one of the 16-wide LRB instructions.

As long as each LRB-compiled task would issue a register-only
instruction, like a 16-wide xor to zero out a register, any trap could
only happen due to this opcode, and not a page fault or other spurious
interrupt.

OTOH, if a system-wide support library would start to use LRB
instructions, then you are of course correct that all processes would
indeed end up on the most capable subset, but I really don't see the
need to do this?

> Moreover,
> the OS can monitor process' performance counters in order to send the most
> CPU-bound threads to a faster core and the IO-bound ones to a slower one.
>
> It shouldn't be that hard for Intel to build such a chip: they already have
> Core2-class cores and can design the simpler ones with relaxed performance
> expectations: in-order, narrow data paths, microcoding everything more
> complex than addition etc.

Might be useful for a laptop chip?

Piotr Wyderski

unread,

Aug 11, 2008, 10:28:21 AM8/11/08

to

Terje Mathisen wrote:

> OTOH, if a system-wide support library would start to use LRB
> instructions, then you are of course correct that all processes would
> indeed end up on the most capable subset, but I really don't see the need
> to do this?

I didn't mean necessity, just possibility. :-)
Such a system will be at an unstable equilibrium, so just a single event
in shared code would cause (almost) global transition into the most
capable subset (a kind of the avalanche effect). It could be a potential
security threat resulting in a DoS-like attack. The mixed architecture
with equally capable cores is, at least in this respect, fool-proof by
design.

> Might be useful for a laptop chip?

Well, IMHO not only. There are many IO-bound processes both in laptops
and (especially) in servers, so they can by executed by a core which is just
as fast as necessary, but not faster. So they might be assigned to a simple
core from the set. If such a process eventually becomes CPU-bound, then
the OS scheduler will dynamicly re-assign it to a faster core. And then back
to a slower one, if there is no further demand for the cycles.

Best regards
Piotr Wyderski