www.sicortex.com

Nick Maclaren

unread,

Apr 2, 2008, 1:25:02 PM4/2/08

to

It looks interesting, but will it succeed? Any serious or even
humorous comments on it appreciated.

Regards,
Nick Maclaren.

Terje Mathisen

unread,

Apr 2, 2008, 4:33:50 PM4/2/08

to

Nick Maclaren wrote:
> It looks interesting, but will it succeed? Any serious or even
> humorous comments on it appreciated.

Interesting indeed.

5.8 TFlops is just in the low end of the useful range, I didn't see any
mention of how you'd gang multiple boxes together, but I assume that
they'd like to sell them so they must have considered this.

Terje

--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Nick Maclaren

unread,

Apr 2, 2008, 4:52:51 PM4/2/08

to

In article <-KCdnWvYdIczcG7a...@giganews.com>,

For top-end HPC, yes. But consider alternative uses. The very
small box should make an excellent development engine for 'serious'
supercomputers (and is my personal interest in it). The medium
one should fit into an office or laboratory without too much hassle.

Also, the great unknown is whether you can get a higher proportion
of its theoretical peak out of it than you can on dual-socket,
quad-core Intel systems. That is the key to whether it is a good
buy or a bad one - plus its price, of course.

Regards,
Nick Maclaren.

Chris Thomasson

unread,

Apr 2, 2008, 8:24:14 PM4/2/08

to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message
news:ft0fhe$7qg$1...@gemini.csx.cam.ac.uk...

>
> It looks interesting, but will it succeed? Any serious or even
> humorous comments on it appreciated.

Very cool! Do you happen to know if they provide an ISA manual and an
assembler? I assume I can use POSIX and C because it runs Linux. I am
interested in the semantics of the assembly instructions that drive their
DMA engines.. I think it would be neat to be able to design custom
message-passing frameworks for this beast.

Terje Mathisen

unread,

Apr 3, 2008, 3:49:42 AM4/3/08

to

Price is the really important consideration:

Currently StatoilHydro has around 20 TF (afair) in its seismic clusters,
and the machine rooms have been through the needed refit to handle the
cooling issue.

In an upgrade, each of the cpu nodes would/will get upgraded, to roughly
double the performance, something which would more or less keep up with
increased needs.

If SiCortex can deliver the same performance at half or lower cost, then
it might be interesting, particularly if it would at the same time allow
scaling to another order of magnitude without power/cooling problems.

Nick Maclaren

unread,

Apr 3, 2008, 4:01:11 AM4/3/08

to

In article <DuqdnavI0b-KEWna...@giganews.com>,

Terje Mathisen <terje.m...@hda.hydro.com> writes:
|>
|> > Also, the great unknown is whether you can get a higher proportion
|> > of its theoretical peak out of it than you can on dual-socket,
|> > quad-core Intel systems. That is the key to whether it is a good
|> > buy or a bad one - plus its price, of course.
|>
|> Price is the really important consideration:

Agreed.

|> Currently StatoilHydro has around 20 TF (afair) in its seismic clusters,
|> and the machine rooms have been through the needed refit to handle the
|> cooling issue.

Not everyone has that option. I was one of the first people to hit
this hard, but a lot of sites are limited by cooling and/or power or
space. They quite simply CAN'T upgrade for less than a cost that would
make any plausible computers look cheap. For example, some of the
finance houses located in central London, New York etc. are like that;
and their available budget makes salesmen drool :-)

So there is still a significant market even without the price factor.
But I agree that performance/Euro will sell more than performance/watt,
performance/sq.m. or performance/Kg.

Regards,
Nick Maclaren.

Paul A. Clayton

unread,

Apr 3, 2008, 11:18:02 AM4/3/08

to

On Apr 2, 8:24 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Nick Maclaren" <n...@cus.cam.ac.uk> wrote in message

The processors themselves are MIPS64, so a trip to www.mips.com
(last time I tried, one had to get an account and log in, zero
financial
cost, modest bother) can get you the ISA information.

According to the limited documentation I have read, the DMA is
not exposed; they suggest the use of their MPI implementation.

Hope that was helpful.

Paul A. Clayton
reachable as 'paaronclayton'
at "embarqmail.com"

Tim McCaffrey

unread,

Apr 3, 2008, 11:41:25 AM4/3/08

to

In article <ft0fhe$7qg$1...@gemini.csx.cam.ac.uk>, nm...@cus.cam.ac.uk says...

>
>
>It looks interesting, but will it succeed? Any serious or even
>humorous comments on it appreciated.
>

Considering the location, is this where one of the lost tribes of DEC ended
up?

- Tim

Nick Maclaren

unread,

Apr 3, 2008, 11:50:41 AM4/3/08

to

In article <ft2tr5$8q9$1...@USTR-NEWS.TR.UNISYS.COM>,

timca...@aol.com (Tim McCaffrey) writes:
|> >
|> >It looks interesting, but will it succeed? Any serious or even
|> >humorous comments on it appreciated.
|> >
|> Considering the location, is this where one of the lost tribes of DEC ended
|> up?

Apparently Apollo, but they have have travelled through DEC first,
and may have picked up some people on the way :-)

Regards,
Nick Maclaren.

Tim McCaffrey

unread,

Apr 3, 2008, 11:51:48 AM4/3/08

to

In article <ft2tr5$8q9$1...@USTR-NEWS.TR.UNISYS.COM>, timca...@aol.com says...

And to followup, Google Earth shows they are in or right next to The Mill.

- Tim

David Kanter

unread,

Apr 3, 2008, 1:09:24 PM4/3/08

to

On Apr 3, 1:01 am, n...@cus.cam.ac.uk (Nick Maclaren) wrote:

Yup. The interesting question is the performance of SiCortex relative
to a heavily power optimized x86 system. For instance, using 50W
2.5GHz cpus instead of 3.2GHz 120W cpus and using DDR2 instead of FB-
DIMMs.

Usually an appropriate x86 box (one that exists) beats most of these
non-traditional solutions or comes near enough that stuff like
SiCortex never makes sense in the long run (i.e. there may be a brief
window where an alternative solution is optimal, but rarely for an
extended period of time).

DK

Greg Lindahl

unread,

Apr 3, 2008, 1:18:51 PM4/3/08

to

In article <ft2uek$987$1...@USTR-NEWS.TR.UNISYS.COM>,
Tim McCaffrey <timca...@aol.com> wrote:

>And to followup, Google Earth shows they are in or right next to The Mill.

They are in the Mill, however, they are the lost tribe of Thinking
Machines. Some DEC people, yes.

These are the same guys who bought the PathScale compiler group.

-- greg

Nick Maclaren

unread,

Apr 3, 2008, 1:32:14 PM4/3/08

to

In article <47f5117b$1...@news.meer.net>, lin...@pbm.com (Greg Lindahl) writes:
|> In article <ft2uek$987$1...@USTR-NEWS.TR.UNISYS.COM>,
|> Tim McCaffrey <timca...@aol.com> wrote:
|>
|> >And to followup, Google Earth shows they are in or right next to The Mill.
|>
|> They are in the Mill, however, they are the lost tribe of Thinking
|> Machines. Some DEC people, yes.

Ah! Them. That figures - it has their style.

Regards,
Nick Maclaren.

Paul Gotch

unread,

Apr 3, 2008, 2:02:59 PM4/3/08

to

David Kanter <dka...@gmail.com> wrote:
> Yup. The interesting question is the performance of SiCortex relative
> to a heavily power optimized x86 system. For instance, using 50W
> 2.5GHz cpus instead of 3.2GHz 120W cpus and using DDR2 instead of FB-
> DIMMs.

They are talking about 600mw 500MHz processors. They say they've licensed
the MIPS64 architecture however I wouldn't be at all supprised if the
processor was actually a MIPS 20Kc or a tweak thereof since they claim
they've taking the approach of licensing resuable IP to build the machine
with a small team.

This is the same approach that IBM have done, all be it in a much more
specialised way, with the BlueGene project.

> Usually an appropriate x86 box (one that exists) beats most of these
> non-traditional solutions or comes near enough that stuff like
> SiCortex never makes sense in the long run (i.e. there may be a brief
> window where an alternative solution is optimal, but rarely for an
> extended period of time).

I remain unconvinced, although we'll see if Intel can come out with a 64 bit
x86 SOC processor which has a worst case power of < 1W at between 500MHz to
1GHz and can deliver the same or greater number of FLOPS.

-p
--
"Unix is user friendly, it's just picky about who its friends are."
- Anonymous
--------------------------------------------------------------------

Jeff Kenton

unread,

Apr 3, 2008, 6:55:28 PM4/3/08

to

At least some of them came from the BBN Butterfly group. Don't know
about any DECies, but they are in the old DEC mill building.

--

---------------------------------------------------------------------
= Jeff Kenton http://home.comcast.net/~jeffrey.kenton =
---------------------------------------------------------------------

Chris Thomasson

unread,

Apr 4, 2008, 4:53:54 AM4/4/08

to

"Paul A. Clayton" <paaron...@earthlink.net> wrote in message
news:5cb22757-ec55-4c1e...@c65g2000hsa.googlegroups.com...

> On Apr 2, 8:24 pm, "Chris Thomasson" <cris...@comcast.net> wrote:
>> "Nick Maclaren" <n...@cus.cam.ac.uk> wrote in message
>>
>> news:ft0fhe$7qg$1...@gemini.csx.cam.ac.uk...
>>
>>
>>
>> > It looks interesting, but will it succeed? Any serious or even
>> > humorous comments on it appreciated.
>>
>> Very cool! Do you happen to know if they provide an ISA manual and an
>> assembler? I assume I can use POSIX and C because it runs Linux. I am
>> interested in the semantics of the assembly instructions that drive their
>> DMA engines.. I think it would be neat to be able to design custom
>> message-passing frameworks for this beast.
>
> The processors themselves are MIPS64, so a trip to www.mips.com
> (last time I tried, one had to get an account and log in, zero
> financial
> cost, modest bother) can get you the ISA information.

Ahh, thanks. I can definitely program with that ISA.

> According to the limited documentation I have read, the DMA is
> not exposed; they suggest the use of their MPI implementation.

IMVHO, if its indeed true that there are no instructions that drive their
DMA engine, well, that's total crap! I would like the flexibility program my
own custom message-passing. Well, if somebody that posts to the group has
access to one of these systems, would you kindly send a disassembly of the
'MPI_Send/Rcv' functions? There has to be some special instructions which
trigger DMA events. Anyway, I did some more reading and found where they
explicitly say that:

"For programs that use multithreading facilities such as pthreads or openMP,
each SiCortex node is a cache-coherent SMP."

So it looks like I could use the MIPS64 instruction set to implement my
existing libraries which make heavy use of shared-memory non-blocking
algorithms. Humm... I wonder if I could use one of my nearly zero-overhead
atomic queue algorithms for message-passing instead of their DMA engines...

> Hope that was helpful.

It was helpful indeed.

Paul A. Clayton

unread,

Apr 4, 2008, 10:43:37 AM4/4/08

to

On Apr 4, 4:53 am, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Paul A. Clayton" <paaronclay...@earthlink.net> wrote in messagenews:5cb22757-ec55-4c1e...@c65g2000hsa.googlegroups.com...
[snip]

> > According to the limited documentation I have read, the DMA is
> > not exposed; they suggest the use of their MPI implementation.
>
> IMVHO, if its indeed true that there are no instructions that drive their
> DMA engine, well, that's total crap! I would like the flexibility program my
> own custom message-passing. Well, if somebody that posts to the group has
> access to one of these systems, would you kindly send a disassembly of the
> 'MPI_Send/Rcv' functions? There has to be some special instructions which
> trigger DMA events. Anyway, I did some more reading and found where they
> explicitly say that:

I did not mean to imply that they hide (as a trade secret) the DMA
interface, merely that they do not give any documentation (linked
from their website) to indicate how such would be used.

> "For programs that use multithreading facilities such as pthreads or openMP,
> each SiCortex node is a cache-coherent SMP."
>
> So it looks like I could use the MIPS64 instruction set to implement my
> existing libraries which make heavy use of shared-memory non-blocking
> algorithms. Humm... I wonder if I could use one of my nearly zero-overhead
> atomic queue algorithms for message-passing instead of their DMA engines...

Well, a node is only one processor chip (6 processors) and 2 DDR2
DIMMs,
so even the Catapult (desk-side) unit has 12 nodes.

> > Hope that was helpful.
>
> It was helpful indeed.

I am pleased.

Paul A. Clayton
just a technophile

Chris Thomasson

unread,

Apr 4, 2008, 11:12:09 AM4/4/08

to

From: "Paul A. Clayton" <paaron...@earthlink.net>
Subject: Re: www.sicortex.com
Date: Friday, April 04, 2008 7:43 AM

On Apr 4, 4:53 am, "Chris Thomasson" <cris...@comcast.net> wrote:
> "Paul A. Clayton" <paaronclay...@earthlink.net> wrote in
> messagenews:5cb22757-ec55-4c1e...@c65g2000hsa.googlegroups.com...
[snip]
> > > According to the limited documentation I have read, the DMA is
> > > not exposed; they suggest the use of their MPI implementation.
> >
> > IMVHO, if its indeed true that there are no instructions that drive
> > their
> > DMA engine, well, that's total crap! I would like the flexibility
> > program my
> > own custom message-passing. Well, if somebody that posts to the group
> > has
> > access to one of these systems, would you kindly send a disassembly of
> > the
> > 'MPI_Send/Rcv' functions? There has to be some special instructions
> > which
> > trigger DMA events. Anyway, I did some more reading and found where they
> > explicitly say that:

> I did not mean to imply that they hide (as a trade secret) the DMA
> interface, merely that they do not give any documentation (linked
> from their website) to indicate how such would be used.

Yeah. I bet that they don't hide them at all.

>> "For programs that use multithreading facilities such as pthreads or
>> openMP,
>> each SiCortex node is a cache-coherent SMP."
>>
>> So it looks like I could use the MIPS64 instruction set to implement my
>> existing libraries which make heavy use of shared-memory non-blocking
>> algorithms. Humm... I wonder if I could use one of my nearly
>> zero-overhead
>> atomic queue algorithms for message-passing instead of their DMA
>> engines...

> Well, a node is only one processor chip (6 processors) and 2 DDR2
> DIMMs,
> so even the Catapult (desk-side) unit has 12 nodes.

Right. I was thinking that I could use existing shared memory
multi-threading techniques for intra-node programming, and using their MPI
interface only for inter-node communication. I don't think I would use MPI
to communicate between processors on the same node. I would rather use the
nodes "local" shared-memory on a for that purpose...

For the Catapult unit, I could use 12 multi-threaded processes where each
process has its execution affinity bound to a separate node. That way the
threads of each process would be running on the CPUS belonging to the
processes node. The threads within a process can use local shared memory to
communicate. The processes would use the MPI interface to communicate. That
simple scheme should work fine on these neat systems...

matt....@sicortex.com

unread,

Apr 6, 2008, 12:49:17 PM4/6/08

to

I'll try to catch up on a few of the questions and conjectures:

1. The SiCortex founders (Jud Leonard, John Mucci, and I) had worked
before at
Digital (all three), Thinking Machines (John), Symbolics (Jud) and
other places.
We've since added folks from a lot of different places. (But I can't
think of any
Apollo alums, though I don't have the whole list of folks at my
fingertips.)

2. The DMA engine microcode is published. The brave and willing can
certainly
use it as a starting place for new approaches.

3. The systems are designed with a focus on reliability, low power,
deployability,
and good price performance for applications that scale to hundreds or
thousands
of processors. (We also fit a few other profiles that go beyond this
focus, but this
is what we were thinking when we designed and built the product.)

4. We implement MPI because that is what the major part of our target
markets
requires. We do pretty well at it too. The MPI implementation (based
on MPICH)
talks to the DMA engine through a work-queue based interface. The DMA
engine
microcode also supports IP over the SiCortex fabric and a high
performance
communication path to support the Lustre parallel file system.

already...@yahoo.com

unread,

Apr 6, 2008, 1:19:25 PM4/6/08

to

Matt,
Could you comment on Paul Gotch's speculations above (about MIPS
20Kc) ?

Nick Maclaren

unread,

Apr 6, 2008, 2:17:40 PM4/6/08

to

In article <aa530dfa-c87f-4569...@k37g2000hsf.googlegroups.com>,

Ah. My source got it wrong, then.

Regards,
Nick Maclaren.

Chris Thomasson

unread,

Apr 9, 2008, 12:20:57 AM4/9/08

to

<matt....@sicortex.com> wrote in message
news:aa530dfa-c87f-4569...@k37g2000hsf.googlegroups.com...

> I'll try to catch up on a few of the questions and conjectures:
>
> 1. The SiCortex founders (Jud Leonard, John Mucci, and I) had worked
> before at
> Digital (all three), Thinking Machines (John), Symbolics (Jud) and
> other places.
> We've since added folks from a lot of different places. (But I can't
> think of any
> Apollo alums, though I don't have the whole list of folks at my
> fingertips.)

A flood of talented people simply cannot hurt anyone!

:^D

> 2. The DMA engine microcode is published. The brave and willing can
> certainly
> use it as a starting place for new approaches.

Perfect.

> 3. The systems are designed with a focus on reliability, low power,
> deployability,
> and good price performance for applications that scale to hundreds or
> thousands
> of processors. (We also fit a few other profiles that go beyond this
> focus, but this
> is what we were thinking when we designed and built the product.)

IMHO, your systems seem to make a lot of sense.

> 4. We implement MPI because that is what the major part of our target
> markets
> requires. We do pretty well at it too. The MPI implementation (based
> on MPICH)
> talks to the DMA engine through a work-queue based interface. The DMA
> engine
> microcode also supports IP over the SiCortex fabric and a high
> performance
> communication path to support the Lustre parallel file system.

What do you think about my initial idea on how to program your systems? That
is, using advanced shared-memory multi-threading techniques for intra-node
communication, and MPI for inter-node communication... I think it should
work very well. I am always interested in being able to create and play
around with my own algorithms. I appreciate that your DMA engine microcode
is available; have you applied for any patents?

matt....@sicortex.com

unread,

Apr 9, 2008, 12:17:49 PM4/9/08

to

We looked at the 20Kc and liked it. However, it was a hard macro
(that is, MIPS supplies completed masks, not synthesizable verilog)
and designed for 130nm. Our technology target was 90nm. The team
had lots of experience in doing design shrinks and felt the cost
of shrinking the 20Kc was prohibitive. Further, the 20Kc as it stood
was not designed for a cache coherent SMP. Fitting the necessary
changes into a hard macro made it even more problematic.

We chose the 5Kf, a 64 bit soft IP block from MIPS. Then we
worked hard at the synthesis flow to make it run at 500 MHz
and stay within a sub-watt power budget.

matt

matt....@sicortex.com

unread,

Apr 9, 2008, 12:22:30 PM4/9/08

to

On Apr 9, 12:20 am, "Chris Thomasson" <cris...@comcast.net> wrote:
>
>
> What do you think about my initial idea on how to program your systems? That
> is, using advanced shared-memory multi-threading techniques for intra-node
> communication, and MPI for inter-node communication... I think it should
> work very well. I am always interested in being able to create and play
> around with my own algorithms. I appreciate that your DMA engine microcode
> is available; have you applied for any patents?

There are folks who use shared memory for the intra-node comms and
MPI between nodes. With care, that can work on SiCortex systems and
deliver good performance. Shared memory programming mixed with
message passing works well for some, brings the problems of both
worlds to others. SiCortex is happy to see both.

Personally, I tend to do all the communications with MPI. That way I
don't
need to worry about processor assignments, task mapping, or even what
platform I'm running on. Most of the code that I see from customers
and
prospects follows a similar model. We have, however seen a few hybrid
codes.

already...@yahoo.com

unread,

Apr 9, 2008, 12:51:43 PM4/9/08

to

Thanks for information, Matt.
I vaguely remember from the Byte articles from the mid 90s that 5Kf
FPU was optimized toward single-precision performance. Did you change
this part of the core?

Del Cecchi

unread,

Apr 9, 2008, 7:49:13 PM4/9/08

to

<matt....@sicortex.com> wrote in message
news:3e072521-34e1-4a8c...@8g2000hse.googlegroups.com...

This sounds a lot like a Blue Gene, only of course with Mips taking the
place of PowerPC as the processor. Would you comment on the differences?

del

matt....@sicortex.com

unread,

Apr 10, 2008, 1:13:39 PM4/10/08

to

On Apr 9, 12:51 pm, already5cho...@yahoo.com wrote:
>
> Thanks for information, Matt.
> I vaguely remember from the Byte articles from the mid 90s that 5Kf
> FPU was optimized toward single-precision performance. Did you change
> this part of the core?

We did goose the FP unit a bit. We rebuilt the FP pipeline to support
double precision at 2FLOPs per cycle (MADD.D a double precision
mull-add), so the double precision FP rate is the same as the single
precision FP rate.

Other than that, we added cache coherence to the L1, and full
single bit correction/double bit detect to the L1 Dcache. (The I
cache is parity protected.)

There were Byte articles in the mid 90's on the 5Kf? Who knew?

matt....@sicortex.com

unread,

Apr 10, 2008, 1:37:45 PM4/10/08

to

On Apr 9, 7:49 pm, "Del Cecchi" <delcecchioftheno...@gmail.com> wrote:
> This sounds a lot like a Blue Gene, only of course with Mips taking the
> place of PowerPC as the processor. Would you comment on the differences?
>
> del

I doesn't sound anything like a Blue Gene -- it is much much
quieter. ;)

The major differences relative to BG/L are

1. Higher performance inter-node communication (higher BW, lower
end-to-end latency (average under 2uS MPI ping-pong).

2. Design centered on 972 nodes and smaller. Our ambitions are
to fill needs in day-to-day production environments, not to beat
the Earth Simulator or occupy slots in the Top500. (Somebody
needs to do that, but we chose to focus elsewhere.)

3. Full linux/posix environment on all processors. All system software
and SiCortex libraries are open source.

4. Up to 8GB of DRAM per 6 processor node.

5. SiCortex has configurations from 72 processors (the deskside
development
system) to 5832 (the cabinet with the gull wing doors). In addition
to the
SC648 (648 processors) and the SC1458 (1458 processors, how DID we
come
up with this naming scheme?) there are incremental versions in between
that
involve replacing processor modules with "placeholder modules."

6. Kautz graph topology for all traffic vs. mesh/torus and trees. The
graph
diameter is 6 for the largest SiCortex system.

7. "Generic" IO as long as you think of PCI Express Modules as
"Generic."
Specifically, all systems support GigE, Infiniband, and
Fiberchannel.
We support others, but I don't have the supported IO list in front of
me right now.

8. BG/L does a better job of managing the processor-memory path:
BG/L stream triads are 6x better than SiCortex. Sigh.

BG/P will probably improve on a few of these, but I haven't seen
results from the BG/P installations yet.

There are probably other differences, but I'm more versed on the
SiCortex side of things than the BG/L.

already...@yahoo.com

unread,

Apr 10, 2008, 2:10:55 PM4/10/08

to

On Apr 10, 7:13 pm, matt.rei...@sicortex.com wrote:
> On Apr 9, 12:51 pm, already5cho...@yahoo.com wrote:
>
>
>
> > Thanks for information, Matt.
> > I vaguely remember from the Byte articles from the mid 90s that 5Kf
> > FPU was optimized toward single-precision performance. Did you change
> > this part of the core?
>
> We did goose the FP unit a bit. We rebuilt the FP pipeline to support
> double precision at 2FLOPs per cycle (MADD.D a double precision
> mull-add), so the double precision FP rate is the same as the single
> precision FP rate.
>

Very well

> Other than that, we added cache coherence to the L1, and full
> single bit correction/double bit detect to the L1 Dcache. (The I
> cache is parity protected.)
>

For massively-parallel scientific workloads ECC (if it is actually
ECC) on L1D cache sounds to me like over-engineering. But what I know?

> There were Byte articles in the mid 90's on the 5Kf? Who knew?

Power of Internet: http://futuretech.blinkenlights.nl/byte.html

Nick Maclaren

unread,

Apr 10, 2008, 2:56:06 PM4/10/08

to

In article <fe2aecdf-9c0c-4289...@v26g2000prm.googlegroups.com>,

already...@yahoo.com writes:
|> On Apr 10, 7:13 pm, matt.rei...@sicortex.com wrote:
|>
|> > Other than that, we added cache coherence to the L1, and full
|> > single bit correction/double bit detect to the L1 Dcache. (The I
|> > cache is parity protected.)
|>
|> For massively-parallel scientific workloads ECC (if it is actually
|> ECC) on L1D cache sounds to me like over-engineering. But what I know?

Why? Mere single-bit detection on very large amounts of cache isn't
nice, because it forces a policy of almost panicky replacement. If
errors were completely independent, that wouldn't be necessary, but
they aren't.

Also, ECC gives you the option of delaying the write-through if the
memory / higher-level cache channel is busy, where parity doesn't.

I agree that it isn't critical, but it means that some other problems
can be avoided, and ECC technology is very well understood :-)

Regards,
Nick Maclaren.

matt....@sicortex.com

unread,

Apr 10, 2008, 5:07:49 PM4/10/08

to

re: ECC on L1D

Nick is right -- ECC is well understood.

We go back and forth on this all the time. Here's the analysis that
always
leads me to ECC:

1. Assume a single-bit-upset rate of 4000 failures per megabit-billion-
hours.
(That's a reasonably conservative rule of thumb for a static RAM array
at
7000ft elevation. There are few reliable numbers with good pedigree,
this isn't
one of them. But you have to start with some model. Actual reported
numbers
are all over the map.)

2. Assume the double-bit upset rate is far far lower. (If you don't
then you need
to look at some other considerations.)

3. For SiCortex: 5832 processors * 32KB * 8 bit/B * 4000 / (1Mbit *
10e9 hrs)
gives 6 failures somewhere in the system every 1000 hours.

And that is the difference between designing with thousands of
processors in
mind and designing for a single desktop or a small cluster. (But note
that even
those systems put ECC on the L1 caches now.)

An alternative is to build a write-through L1 and many good designs
take this
route. At SiCortex we decided to put ECC in the L1 and keep the L1 to
L2
write path clean. (Write through caches either require write
aggregating -- a
wonderland of interesting memory ordering issues that must be
addressed -- or
partial word writes into the L2 -- generally extra hair that we chose
to avoid.)
BG/L chose to do a write-through L1, as I recall. To each his own...

The SiCortex system design paid careful attention to reliability and
error
recovery. Wires aren't perfect. Bad things sometimes happen to good
bits.
Transients dominate.

And so, the SC5832, with over 26K diff pairs in its fabric, does
automatic
error detection and retry at the link level. (And we guarantee in-
order
message delivery, even in the presence of transient errors.
Permanent
faults are another (and thankfully, much much rarer) matter.

We "over engineer" the thermal aspects so each node chip runs
relatively cool.

Every RAM array that contains "unique" data (all RAM arrays other than
the
ICache) is protected by ECC.

And when things go wrong, the system can configure around sick nodes
until
a service intervention can be scheduled.

Apologies if this sounded like a commercial. That wasn't my intent.
The intent
is to get across the idea that large N multiprocessors like BG/L and
SiCortex and
Cray XT3/XT4 systems only work well when the designers get their heads
around
the idea that every error source gets multiplied by three orders of
magnitude or more.

When you are designing a quad socket server pizza box, you could
reasonably
choose to ignore hardware error cases that happen every 16K hours:
chances are
the user will blame the software anyway. There are responsible,
professional, admirable
designers who make this choice every day. Not everybody is willing to
pay for
hardware reliability.

But when the component is designed to be part of an ensemble of
thousands of
components, the reliability calculation changes. That two year MTTF
turns into
a one day MTTF, and no amount of bad software can absorb all the
blame. ;)

Del Cecchi

unread,

Apr 10, 2008, 7:05:07 PM4/10/08

to

<matt....@sicortex.com> wrote in message
news:13192c58-df20-464e...@8g2000hsu.googlegroups.com...

Thanks. now I have to go bone up on "kautz graph topology"

And yes, I have heard BG/P network is better. Not surprising since time
has passed....

del

Paul A. Clayton

unread,

Apr 11, 2008, 5:19:25 PM4/11/08

to

On Apr 10, 1:37 pm, matt.rei...@sicortex.com wrote:
> On Apr 9, 7:49 pm, "Del Cecchi" <delcecchioftheno...@gmail.com> wrote:
>
> > This sounds a lot like a Blue Gene, only of course with Mips taking the
> > place of PowerPC as the processor. Would you comment on the differences?
>
> > del
>
> I doesn't sound anything like a Blue Gene -- it is much much
> quieter. ;)
>
> The major differences relative to BG/L are

I would also consider the 2-wide FPU of Blue Gene processor a
significant difference (4 FLOPs per cycle vs. 2 FLOPs per cycle
for the SiCortex processor). I was a bit disappointed that the
SiCortex did not exploit such SIMD. Perhaps the targeted
workloads are not as computationally dense (i.e., the system
would be unbalanced relative to memory bandwidth or other
resources)? (Two element 'vectorizability' is common, isn't it??)

matt....@sicortex.com

unread,

Apr 12, 2008, 8:10:00 PM4/12/08

to

On Apr 11, 5:19 pm, "Paul A. Clayton" <paaronclay...@earthlink.net>
wrote:

>
> I would also consider the 2-wide FPU of Blue Gene processor a
> significant difference (4 FLOPs per cycle vs. 2 FLOPs per cycle

> for theSiCortexprocessor). I was a bit disappointed that theSiCortexdid not exploit such SIMD. Perhaps the targeted

>
> Paul A. Clayton
> just a technophile
> reachable as 'paaronclayton'
> at "embarqmail.com"

Early on, BG/L users had trouble coaxing the compilers to use the
second
FP pipe, as I recall. Del? The issue rules and exclusions around the
second
pipe took some adapting.

Adding more FP without commensurate memory bandwidth
and communications bandwidth is often a waste of power.

But getting 4FLOPS/cycle of single precision would be handy. sigh.

matt

Del Cecchi

unread,

Apr 12, 2008, 11:43:28 PM4/12/08

to

<matt....@sicortex.com> wrote in message
news:adc0a048-6ad5-4030...@c65g2000hsa.googlegroups.com...

Don't look at me. I was a circuit designer. Well, I still am but not
actively at the moment. So all that software stuff is out of my field.
Holes and electrons and femtofarads and picoseconds, now you're talking
my language.... :-)

Nick Maclaren

unread,

Apr 13, 2008, 3:40:55 AM4/13/08

to

In article <adc0a048-6ad5-4030...@c65g2000hsa.googlegroups.com>,
matt....@sicortex.com writes:
|> On Apr 11, 5:19 pm, "Paul A. Clayton" <paaronclay...@earthlink.net>

|> wrote:
|> >
|> Adding more FP without commensurate memory bandwidth
|> and communications bandwidth is often a waste of power.

Yes.

|> But getting 4FLOPS/cycle of single precision would be handy. sigh.

Not really, except for image and audio work. While there are a few
applications for which single precision is enough on a modern high-
performance computer (please note), there aren't many. There are a
fair number where it would be enough, if all parts of it programmed
by a numerical expert - and there are damn few of those still active.

You can't even solve a large, well-conditioned set of linear equations
in single precision, if the matrix is banded - there is a limit of
about a million on the number of equations at which point ALL accuracy
is lost. See Wilkinson and Reinsch. Solving a mere 10,000 (which can
be done easily even for unbanded matrices) won't give you more than
about 1% accuracy. And those are the BEST cases - any ill-conditioning
and forget it!

Regards,
Nick Maclaren.

Bernd Paysan

unread,

Apr 13, 2008, 9:23:48 AM4/13/08

to

Nick Maclaren wrote:
> You can't even solve a large, well-conditioned set of linear equations
> in single precision, if the matrix is banded - there is a limit of
> about a million on the number of equations at which point ALL accuracy
> is lost. See Wilkinson and Reinsch. Solving a mere 10,000 (which can
> be done easily even for unbanded matrices) won't give you more than
> about 1% accuracy. And those are the BEST cases - any ill-conditioning
> and forget it!

Question: How do you get the coefficients for such a matrix to be more
accurate than SP? I mean for a real-world problem. Most measurements give
you something between 8 and 24 integer bits; in some rare occasions, you
get a few more. E.g. if you do your 14 days weather forecast with "the
required precision", you forget that your actual data is not that precise,
either. You may not lose any precision during the calculation process, but
if you change one of the inputs by just one bit, you end up with completely
different weather.

It's a shame that random rounding is not supported by common number
crunching hardware. When you want to know if your algorithm actually works
with imprecise input data, you can just try with slight changes on the
input. But then, you still need to have enough headroom in the actual
calculation, and in all non-trivial iterative calculations, you can't
really know. So I'd rather like to have random rounding, and when the
result is quite different each time, I just know that it's not stable.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Del Cecchi

unread,

Apr 14, 2008, 10:55:33 AM4/14/08

to

"Bernd Paysan" <bernd....@gmx.de> wrote in message
news:44e9d5-...@vimes.paysan.nom...
>
(snip)

> It's a shame that random rounding is not supported by common number
> crunching hardware. When you want to know if your algorithm actually
> works
> with imprecise input data, you can just try with slight changes on the
> input. But then, you still need to have enough headroom in the actual
> calculation, and in all non-trivial iterative calculations, you can't
> really know. So I'd rather like to have random rounding, and when the
> result is quite different each time, I just know that it's not stable.
>
> --
> Bernd Paysan
> "If you want it done right, you have to do it yourself"
> http://www.jwdt.com/~paysan/

True random numbers are a real pain to generate on a chip. Would
pseudo-random work?

Bernd Paysan

unread,

Apr 14, 2008, 11:25:28 AM4/14/08

to

Del Cecchi wrote:
> True random numbers are a real pain to generate on a chip. Would
> pseudo-random work?

Usually, yes. Last time I embedded such a random number generator into a
chip (for Audio DSP stuff), I used an xorshift RNG. The main requirement is
that all bits are shuffled around for the next cycle, and that the overall
bit pattern is unpredictable from the program's point of view. For all
current CPUs, generating a new pseudo random number every cycle (regardless
what the CPU is doing) is sufficient, and if you have several units using
such a random number for rounding, have a different random number sequence
(different start value should be sufficient) for each unit is a good idea,
too.

Greg Lindahl

unread,

Apr 14, 2008, 1:13:02 PM4/14/08

to

In article <44e9d5-...@vimes.paysan.nom>,
Bernd Paysan <bernd....@gmx.de> wrote:

>E.g. if you do your 14 days weather forecast with "the
>required precision", you forget that your actual data is not that precise,
>either. You may not lose any precision during the calculation process, but
>if you change one of the inputs by just one bit, you end up with completely
>different weather.

... which is why many weather forecasts are done as an ensemble
computation.

>It's a shame that random rounding is not supported by common number
>crunching hardware.

You can always randomly perturb the data during the computation; no
need for hardware.

-- greg

Nick Maclaren

unread,

Apr 24, 2008, 5:20:45 AM4/24/08

to

In article <4803909e$1...@news.meer.net>,

That's not equivalent, and doesn't have the same advantages, but you
can obviously do random rounding in software just as well as in
hardware (if much more slowly).

Regards,
Nick Maclaren.

Nick Maclaren

unread,

Apr 24, 2008, 5:23:47 AM4/24/08

to

In article <8k9cd5-...@annette.mikron.de>,

Bernd Paysan <bernd....@gmx.de> writes:
|> Del Cecchi wrote:
|> > True random numbers are a real pain to generate on a chip. Would
|> > pseudo-random work?
|>
|> Usually, yes. Last time I embedded such a random number generator into a
|> chip (for Audio DSP stuff), I used an xorshift RNG. The main requirement is
|> that all bits are shuffled around for the next cycle, and that the overall

|> bit pattern is unpredictable from the program's point of view. ...

Yes, precisely. It is an open theoretical question whether that it is
always possible, but all practical evidence is that it is. You need
to do a bit better than a simple shift-register RNG, but there is no
problem in implementing extremely fast, excellent quality RNGs in
hardware.

Regards,
Nick Maclaren.

Nick Maclaren

unread,

Apr 24, 2008, 5:26:50 AM4/24/08

to

In article <44e9d5-...@vimes.paysan.nom>,

Bernd Paysan <bernd....@gmx.de> writes:
|>
|> > You can't even solve a large, well-conditioned set of linear equations
|> > in single precision, if the matrix is banded - there is a limit of
|> > about a million on the number of equations at which point ALL accuracy
|> > is lost. See Wilkinson and Reinsch. Solving a mere 10,000 (which can
|> > be done easily even for unbanded matrices) won't give you more than
|> > about 1% accuracy. And those are the BEST cases - any ill-conditioning
|> > and forget it!
|>
|> Question: How do you get the coefficients for such a matrix to be more

|> accurate than SP? I mean for a real-world problem. ...

I could give examples (not many), but that's not my point. My point is
that the actual OPERATION (i.e. solution of equations) introduces an
error of the dimension times the precision times the condition number.
Even if the last is 1.0 (fairly rare in practice), single precision
solutions are grossly inaccurate. As I said, see Wilkinson and Reinsch.

Regards,
Nick Maclaren.

Stephen Fuld

unread,

Apr 24, 2008, 11:53:29 AM4/24/08

to

Since you guys are talking about *pseudo* RNGs, not "true" RNGs, can I
correctly surmise that pseudo is sufficient for this application?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Nick Maclaren

unread,

Apr 24, 2008, 12:09:05 PM4/24/08

to

In article <Z12Qj.124846$D_3.1...@bgtnsc05-news.ops.worldnet.att.net>,

Almost certainly :-)

I am a bit rusty, but can go on about this topic for hours, at any
level between practical programming and mathematical theory. The
executive summary is three things:

1) Pseudo-random numbers are good enough for any practical purpose,
provided that their restrictions are understood, and the generators
are of good enough quality for their uses.

2) Arranging that is so is a little-understood topic - note that
Knuth gives only an introduction to the topic, and I could tell you a
few critical things he doesn't cover.

3) With modern techniques and constraints, it is easy to build
an extremely high-quality generator in hardware that runs at a very
high speed and low cost.

Regards,
Nick Maclaren.

Del Cecchi

unread,

Apr 25, 2008, 10:58:31 AM4/25/08

to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message
news:fuqbb1$lm9$1...@gemini.csx.cam.ac.uk...

Except of course for those things that require actual randomness. What
is your opinion about cryptographic applications?

>
> 2) Arranging that is so is a little-understood topic - note that
> Knuth gives only an introduction to the topic, and I could tell you a
> few critical things he doesn't cover.
>
> 3) With modern techniques and constraints, it is easy to build
> an extremely high-quality generator in hardware that runs at a very
> high speed and low cost.

Again, I presume you are talking about PseudoRandom and not true random
number generators.
>
>
> Regards,
> Nick Maclaren.

Nick Maclaren

unread,

Apr 25, 2008, 12:04:01 PM4/25/08

to

In article <67e9siF...@mid.individual.net>,

"Del Cecchi" <delcecchi...@gmail.com> writes:
|>
|> > 1) Pseudo-random numbers are good enough for any practical purpose,
|> > provided that their restrictions are understood, and the generators
|> > are of good enough quality for their uses.
|>
|> Except of course for those things that require actual randomness. What
|> is your opinion about cryptographic applications?

But this was in the context of rounding arithmetic operations! I agree
that pseudo-random numbers are NBG for cryptographic applications, but
must make a slight correction to your implication.

Cryptographic applications need unpredictability, not statistical
randomness, and it is possible for a generator to be excellent for
cryptographic work and dire for statistical. A lot of nonsense is
based on the mathematically true but practically irrelevant statement
that a perfect generator is ideal for both purposes - yes, it is, but
what engineer can deliver perfection?

|> > 3) With modern techniques and constraints, it is easy to build
|> > an extremely high-quality generator in hardware that runs at a very
|> > high speed and low cost.
|>
|> Again, I presume you are talking about PseudoRandom and not true random
|> number generators.

Yes. I believe that unpredictable generators can deliver only two of
high quality, high speed and low cost.

Regards,
Nick Maclaren.