New IBM Supercomputer to use register windows.

Brett Davis

unread,

Nov 1, 2010, 11:55:03 PM11/1/10

to

New IBM Supercomputer to use:
GCC
Register windows
Fixed width instructions
40 bit instructions
256 registers
Unified int/float registers
Four partitioned register banks
Cell like local memory only, no global RAM.

http://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=Kenny.pdf
from:
http://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=view&target=Kenny.pdf
from:
http://www.realworldtech.com/forums/index.cfm?action=detail&id=113934&threadid=113934&roomid=2

My thoughts on importance, from for performance, to irrelevant/stupid:

This is feather in my cap for predicting Cell style architectures
are the future.

Partitioned register files are also a given for the nose bleed end
of the performance market.
A 68000 style address verses data register partitioning is good enough
for 99% of the market. (This deserves its own thread.)
Four wide does give you vector performance with no vector registers...
I expect each pipeline to be dual issue, so eight ops/cycle.
Also expect something like vector loads/stores to all 4 pipes at once.
This may have dictated the register windows, all four pipes need to
use the same register number.

Unified int/float registers, a given, as a separate partition just for
floats is a waste of die space.

256 visible registers, big iron overkill.

40 bit fixed width instructions, while Cray and I would normally
attack this choice as stupid, instruction RAM size for this
supercomputer is just not the cost issue it is in all other markets.
Fixing the instruction set can wait until after IBM figures out how
to make this design work.

Register windows...
This is due to a host of shortcomings in the compiler and due to
a need to reduce the instruction footprint. Register saves/restores
do not need visible instructions. I should have listed this as
an advantage of register windows in the Disadvantages of Register
Windows thread two months ago.
Done correctly in a separate pipeline this has the potential to
make register windows a tiny win. This does not scale down market,
I still view register windows as an epic fail.

GCC, have these poor fools not heard of LLVM and Clang.
Yes, its a matter of existing GCC changes, and no money/time to
convert to a more modern compiler.

Brett

PS: This design shows some "richness and variety of computing world",
I have been looking for, as opposed to all those boring RISC chips.

nedbrek

unread,

Nov 2, 2010, 10:00:28 AM11/2/10

to

"Brett Davis" <gg...@yahoo.com> wrote in message
news:ggtgp-FEC781....@news.isp.giganews.com...

> New IBM Supercomputer to use:
> GCC
> Register windows
> Fixed width instructions
> 40 bit instructions
> 256 registers
> Unified int/float registers
> Four partitioned register banks
> Cell like local memory only, no global RAM.
>

> http://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=view&target=Kenny.pdf

Page 5:
"Variable number of INs, OUTs, and LOCALs.
Much more interesting than SPARC or IA-64."

Yea, that's what I want, more interesting than IA-64 :)

> GCC, have these poor fools not heard of LLVM and Clang.
> Yes, its a matter of existing GCC changes, and no money/time to
> convert to a more modern compiler.

Page 7:
"Making GCC handle a variable number of INs and
OUTs this was difficult." :)

Page 8:
"Uses a 40 bit fixed width instruction."
How does this work on the hardware side? Are I/D lines separated in the L2?
Harvard arch? The discussion on page 9 sounds like they just eat it (is
there no VM?)

Weird.

Ned

nm...@cam.ac.uk

unread,

Nov 2, 2010, 10:26:17 AM11/2/10

to

In article <ggtgp-FEC781....@news.isp.giganews.com>,

Brett Davis <gg...@yahoo.com> wrote:
>New IBM Supercomputer to use:
>

>This is feather in my cap for predicting Cell style architectures
>are the future.

Whatever tickles your fancy :-)

I suspect a certain amount of Hitachi technology (or at least
influence) in there - and, yes, they have a technology sharing
agreement that covers this.

Regards,
Nick Maclaren.

Niels Jørgen Kruse

unread,

Nov 2, 2010, 10:24:11 AM11/2/10

to

nedbrek <ned...@yahoo.com> wrote:

They write that the instruction pointer counts instructions, not bytes,
so presumably on a jump they skip over leading bits to get to the
instruction.

On page 10 there is a strange sentence: "The machine has a uniform
address space (NUMA)."

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark

Niels Jørgen Kruse

unread,

Nov 2, 2010, 10:29:50 AM11/2/10

to

nedbrek <ned...@yahoo.com> wrote:

They write that the instruction pointer counts instructions, not bytes,
so they multiply by 5 to get the byte address.

Robert Myers

unread,

Nov 2, 2010, 12:59:51 PM11/2/10

to

""Niels Jørgen Kruse"" <nos...@ab-katrinedal.dk> wrote in message
news:1jrc8fz.131qeud11svgvjN%nos...@ab-katrinedal.dk...

.
>
> On page 10 there is a strange sentence: "The machine has a uniform
> address space (NUMA)."
>

I don't see what's puzzling.

The address space is uniform, the access is non-uniform. That's pretty
much standard, isn't it?

I don't understand Brett's comment about it being Cell-like at all. I
thought Cell computers used an address space local to the Cell. That is to
say, non-unifom access to a non-global address space.

Robert.

Paul A. Clayton

unread,

Nov 2, 2010, 1:24:57 PM11/2/10

to

On Nov 2, 10:24 am, nos...@ab-katrinedal.dk (Niels Jørgen Kruse)
wrote:
> nedbrek <nedb...@yahoo.com> wrote:
[snip]

> > Page 8:
> > "Uses a 40 bit fixed width instruction."
> > How does this work on the hardware side? Are I/D lines separated in the L2?
> > Harvard arch? The discussion on page 9 sounds like they just eat it (is
> > there no VM?)
>
> They write that the instruction pointer counts instructions, not bytes,
> so presumably on a jump they skip over leading bits to get to the
> instruction.

The problem is not addressing the instructions (as noted
multiplication by five is not that expensive) but in any
area/level of memory used for data or instructions the
fetch granularity will have to be optimized for one or
the other. Stealing ECC bits would allow a generic
fetch block to have a whole number of instructions (e.g.,
13*5B=65B+7B metadata/ECC), but 13-instruction Icache
blocks would be a bit awkward even for an L2 Icache!
Having (L2) Icache blocks five times the size of the
generic fetch unit would seem inefficient. Subblocking
could be used, but one would then want a mechanism to
use the cache space that would otherwise be associated
with an invalid subblock. (48-bit instructions might
be a little easier in this regard--only three fetch
blocks required.)

Alternatively, encoding 24 instructions in 128 bytes
(with the extra eight bytes used for metadata or
special immediates) might not be too wasteful of
capacity. This would still require some subblock
tricks.

> On page 10 there is a strange sentence: "The machine has a uniform
> address space (NUMA)."

I believe this is referring to the use of simple pointers
even though there are memory regions with different
bandwidth and latencies characteristics (e.g., there
would not be different pointer types for L1 scratchpad
versus off-chip DRAM).

Paul A. Clayton
just a technophile

> --
> Mvh./Regards, Niels J rgen Kruse, Vanl se, Denmark

Paul A. Clayton

unread,

Nov 2, 2010, 1:39:09 PM11/2/10

to

On Nov 2, 12:59 pm, "Robert Myers" <rbmyers...@gmail.com> wrote:
> ""Niels Jørgen Kruse"" <nos...@ab-katrinedal.dk> wrote in messagenews:1jrc8fz.131qeud11svgvjN%nos...@ab-katrinedal.dk...

> .
>
>
>
> > On page 10 there is a strange sentence: "The machine has a uniform
> > address space (NUMA)."
>
> I don't see what's puzzling.
>
> The address space is uniform, the access is non-uniform. That's pretty
> much standard, isn't it?
>
> I don't understand Brett's comment about it being Cell-like at all. I
> thought Cell computers used an address space local to the Cell. That is to
> say, non-unifom access to a non-global address space.

page 10:
"*Each processor has several classes of memory that
are local.
– The classes differ in size and latency."

page 11:
"*The processing elements consist of an execution unit and some
local memory.
* No global memory.
* The entire system looks like a large number of these processing
elements connected by a big packet switch."

Rick Jones

unread,

Nov 2, 2010, 1:16:30 PM11/2/10

to

> > On page 10 there is a strange sentence: "The machine has a uniform
> > address space (NUMA)."

> I believe this is referring to the use of simple pointers even
> though there are memory regions with different bandwidth and
> latencies characteristics (e.g., there would not be different
> pointer types for L1 scratchpad versus off-chip DRAM).

Shouldn't that then say "unified address space?"

rick jones
--
the road to hell is paved with business decisions...
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

Quadibloc

unread,

Nov 2, 2010, 10:42:11 PM11/2/10

to

On Nov 2, 8:00 am, "nedbrek" <nedb...@yahoo.com> wrote:
> "Brett Davis" <gg...@yahoo.com> wrote in message
> news:ggtgp-FEC781....@news.isp.giganews.com...

> Page 5:

> "Variable number of INs, OUTs, and LOCALs.
> Much more interesting than SPARC or IA-64."
>
> Yea, that's what I want, more interesting than IA-64 :)

Well, "more interesting" doesn't have to mean "more complicated".

Fortunately,

http://www.quadibloc.com/arch/arcint.htm

is unlikely to be implemented.

> Page 8:
> "Uses a 40 bit fixed width instruction."
> How does this work on the hardware side? Are I/D lines separated in the L2?
> Harvard arch? The discussion on page 9 sounds like they just eat it (is
> there no VM?)

Since they have to multiply the instruction location by five to get
the address, yes, they must "just eat it" - the memory data bus would
have to be a power of two.

What I would advocate is to fetch five blocks of memory at once -
perhaps the path to memory is 256 bits wide - and have an instruction
cache that's organized in such a way that it is divided into 40-bit
words easily. Load it with five 256-bit blocks, read out sixteen 40-
bit instructions.

I show a diagram of a more general mechanism to allow a computer to
use its L2 cache to change its word length to multiple desired values
at

http://www.quadibloc.com/arch/ar05.htm

John Savard

Andy "Krazy" Glew

unread,

Nov 3, 2010, 1:40:59 AM11/3/10

to

On 11/2/2010 10:24 AM, Paul A. Clayton wrote:
> On Nov 2, 10:24 am, nos...@ab-katrinedal.dk (Niels Jørgen Kruse)
> wrote:
>> nedbrek<nedb...@yahoo.com> wrote:
> [snip]
>>> Page 8:
>>> "Uses a 40 bit fixed width instruction."
>>> How does this work on the hardware side? Are I/D lines separated in the L2?
>>> Harvard arch? The discussion on page 9 sounds like they just eat it (is
>>> there no VM?)
>>
>> They write that the instruction pointer counts instructions, not bytes,
>> so presumably on a jump they skip over leading bits to get to the
>> instruction.
>
> The problem is not addressing the instructions (as noted
> multiplication by five is not that expensive) but in any
> area/level of memory used for data or instructions the
> fetch granularity will have to be optimized for one or
> the other.

No need to get fancy.

Just fetch ordinary cache lines.

If they want, the I$ can be optimized to handle 40 bit instructions;
specifically, they might use the self-aligned trick to fetch
instructions that wrap around a cache line. But even this seems
unnecessary. x86 does well with far more awkward instruction sizes just
by fetching 16-bytes at a time.

---

Because the instruction pointer counts instructions, they just take
IP<<2+IP to calculate a byte address. I am sure that they don't need to
calculate this all the time: certainly not if fetching consecutive
instructions from the I$. Possibly only when doing a branch; possibly
not for direct branches, possibly only for indirect branches; and
possible not even then, in many cases.

---

IMHO this is one of the most reasonable decisions they seem to have
made. It is silly to let yourself be hamstrung by power of two sized
instructions.

I've worked a lot on both 24 bit and 40 bit instruction sets --- the
iA32plus reference on my CV is to such an instruction set.

Terje Mathisen

unread,

Nov 3, 2010, 2:15:32 AM11/3/10

to

Andy "Krazy" Glew wrote:
> On 11/2/2010 10:24 AM, Paul A. Clayton wrote:
>> The problem is not addressing the instructions (as noted
>> multiplication by five is not that expensive) but in any
>> area/level of memory used for data or instructions the
>> fetch granularity will have to be optimized for one or
>> the other.
>
> No need to get fancy.
>
> Just fetch ordinary cache lines.

Thanks!

I'd been waiting for someone to point out that this isn't a real problem:

Since pretty much all of us are working on x86-based machines these
days, we all have existence proofs that unaligned opcodes work fine.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Brett Davis

unread,

Nov 3, 2010, 2:28:44 AM11/3/10

to

In article <kuXzo.5805$Mk2....@newsfe13.iad>,
"Robert Myers" <rbmye...@gmail.com> wrote:

How could you possibly port UNIX to this chip unless each CPU had an
address space that was global?
It is nice if most memory copies between CPUs act like a memcpy
so you have a new guaranteed local address.
But things like mutexes and semaphores are designed around a fixed
address that is global.

The shocking part is that having a global memory pool was declared
useless and unnecessary, actually far worse, a counterproductive
hinderance to performance. We see shades of this on the PS3.
We saw this as PS3 code was back ported to the XBox360 and doubled
the performance of the XBox360 hardware.

Some might say this is just a PC cluster, but that is a completely
wrong mindset.
This is small local memory and massive interconnect. Programs will
be written like Cell, computation will stream between multiple CPUs
with each CPU doing a (large) fraction of the work.

XBox360 proved that generic RISC chips could run Cell software.
Instruction set is just not important.

Lack of global memory pool, "small" local memory, and
massive interconnect point to a Cell mindset.
This is more important than the unique instruction set, plug a
MIPS chip in this design instead and it is still a Cell architecture.

Brett

Message has been deleted

Paul A. Clayton

unread,

Nov 3, 2010, 9:22:06 AM11/3/10

to

On Nov 3, 1:40 am, "Andy \"Krazy\" Glew" <a...@SPAM.comp-arch.net>
wrote:
[snip]

> No need to get fancy.
>
> Just fetch ordinary cache lines.
>
> If they want, the I$ can be optimized to handle 40 bit instructions;
> specifically, they might use the self-aligned trick to fetch
> instructions that wrap around a cache line. But even this seems
> unnecessary. x86 does well with far more awkward instruction sizes just
> by fetching 16-bytes at a time.

I received the impression that the typical fetch-and-decode
was intended to be very simple: "Fixed length instruction
machines have simpler decoder circuitry." (page 8).

After a little thought, 'wasting' some memory bandwidth is
not a horrible crime--not all instructions read from memory
will be used anyway.

> ---
>
> Because the instruction pointer counts instructions, they just take
> IP<<2+IP to calculate a byte address. I am sure that they don't need to
> calculate this all the time: certainly not if fetching consecutive
> instructions from the I$. Possibly only when doing a branch; possibly
> not for direct branches, possibly only for indirect branches; and
> possible not even then, in many cases.
>
> ---
>
> IMHO this is one of the most reasonable decisions they seem to have
> made. It is silly to let yourself be hamstrung by power of two sized
> instructions.

But then how much benefit does fixed instruction length
provide?

nm...@cam.ac.uk

unread,

Nov 3, 2010, 12:33:39 PM11/3/10

to

In article <e6b0b3c2-0f76-419c...@s4g2000yql.googlegroups.com>,
Paul A. Clayton <paaron...@gmail.com> wrote:
>On Nov 3, 1:40=A0am, "Andy \"Krazy\" Glew" <a...@SPAM.comp-arch.net>
>wrote:
>>

>> IMHO this is one of the most reasonable decisions they seem to have
>> made. It is silly to let yourself be hamstrung by power of two sized
>> instructions.
>
>But then how much benefit does fixed instruction length
>provide?

As much as it always did. Using a power of two size for instructions
provides no significant advantage over using any other fixed size.

Regards,
Nick Maclaren.

Andy "Krazy" Glew

unread,

Nov 3, 2010, 12:37:10 PM11/3/10

to

On 11/3/2010 6:22 AM, Paul A. Clayton wrote:
> On Nov 3, 1:40 am, "Andy \"Krazy\" Glew"<a...@SPAM.comp-arch.net>
> wrote:
> [snip]
>> No need to get fancy.
>>
>> Just fetch ordinary cache lines.
>

> I received the impression that the typical fetch-and-decode
> was intended to be very simple: "Fixed length instruction
> machines have simpler decoder circuitry." (page 8).
>
>>

>> IMHO this is one of the most reasonable decisions they seem to have
>> made. It is silly to let yourself be hamstrung by power of two sized
>> instructions.
>
> But then how much benefit does fixed instruction length
> provide?

There are 2, maybe 3, aspects to instruction decode:
a) alignment of a given instruction
b) decode
with the third aspect being for superscalar
c) finding the start of instructions after the current, and aligning those.

These are easy with fixed sized 40 bit instructions (indeed, for any
fixed size):

They will have to have fairly arbitrary byte alignment circuitry.
(Although, they could have, e.g. 5 16 byte ifetch buffers, 80 bytes, and
have a restricted instruction alignment network. Or maybe 5x8. Or ...
Hmmm... it might be tempting to make the entire I$ work like that, 5
"alignment banks")

After alignment, the fixed length decoding is as easy as any other fixed
length instruction set. Probably easier, since you won't need to cram
so much to make it fit into 32 bits.

nm...@cam.ac.uk

unread,

Nov 3, 2010, 2:20:24 PM11/3/10

to

In article <ggtgp-4AB715....@news.isp.giganews.com>,

Brett Davis <gg...@yahoo.com> wrote:
>In article <kuXzo.5805$Mk2....@newsfe13.iad>,
> "Robert Myers" <rbmye...@gmail.com> wrote:
>
>> I don't understand Brett's comment about it being Cell-like at all. I
>> thought Cell computers used an address space local to the Cell. That is to
>> say, non-unifom access to a non-global address space.
>
>How could you possibly port UNIX to this chip unless each CPU had an
>address space that was global?

Having supported a Unix that ran on a distributed memory system,
it's a lot easier than you think. Not easy, but feasible.

Regards,
Nick Maclaren.

Jason Riedy

unread,

Nov 3, 2010, 3:53:12 PM11/3/10

to

And nm...@cam.ac.uk writes:
> In article <e6b0b3c2-0f76-419c...@s4g2000yql.googlegroups.com>,
> Paul A. Clayton <paaron...@gmail.com> wrote:

[...]

>> But then how much benefit does fixed instruction length
>> provide?
>
> As much as it always did. Using a power of two size for instructions
> provides no significant advantage over using any other fixed size.

To the *architecture*. Making analysis and debugging tools simpler may
be a significant advantage to fixed-length instructions. A program
becomes a regularly accessed array, there's no state for insn parsing,
etc.

Other issues are easy to overcome, but making debugging tools more
complex scares me.

Jason

nm...@cam.ac.uk

unread,

Nov 3, 2010, 4:08:07 PM11/3/10

to

In article <87k4kur...@NaN.sparse.dyndns.org>,

You have got two responses confused. I said what you said some
threads back, and (clearly) agree with you.

However, my response was saying that instruction sizes of 32 or
64 convey virtually no benefit over ones of 40, 37 or any other
size. There's some extra grobble in the fetcher, but instruction
streams are (a) streams and (b) read-only. If you allow code
to modify the next few instructions, then things are different,
but nobody sane allows that nowadays.

Regards,
Nick Maclaren.

Andy "Krazy" Glew

unread,

Nov 3, 2010, 10:12:39 PM11/3/10

to

I think the IBM folk have proposed a fixed length 40-bit ISA that treats
alll of *virtual* memory as an array. No overlapping instructions.

(I may be projecting, because that's what the funny fixed length ISAs I
have been involved in did.)

I.e.
virtual address bytes 0-4 are instruction 0
5-9 -> instruction 1
...
5n-5n+4 -> instruction n
...
you are NOT allowed to have misaligned, overlapping, instructions,
e.g. in virtual address bytes 2,3,4,5,6

This leaves virtual address page mapping as the only way to get
overlapping instructions. E.g. the same page might be mapped at both
address 0 and 4K, meaning that bytes 0,1,2,3,4 of the page are an
instruction when mapped at virtual address 0,
and byte 0,1,2,3 form an instruction whose first byte is byte 4095 of
the page mapped at 0, when the page is mapped at virtual address 4K (4096).

To eliminate that source of overlapping instructions, one can imagine
performing the mapping only with a page - say a 4K page.

e.g. instruction number in page =
(Address mod 4K) mod 5
0-4 -> 0
5-9 -> 1
...
4090-4094 -> 817
byte 4091 is reserved, and cannot be used as part of an instruction.

You have to modify the logic to increment instructions sequentially a
bit. Plus, an instruction address tends to want to look like
(VirtualAddressPageNumber,NumberOfInstructionInPage)

Or, rather InstructionNumberInVirtualInstructionNumberSpace=>
VirtualAddressPageNumber*817 + InstructionNumberInPage.

I started writing this all up last night on the comp-arch.net wiki, but
Google Chrome hiccuped and I lost everything. I'll redo it tonight -
right after this, in fact.

Andy "Krazy" Glew

unread,

Nov 3, 2010, 10:40:31 PM11/3/10

to

On 11/3/2010 1:08 PM, nm...@cam.ac.uk wrote:

> However, my response was saying that instruction sizes of 32 or
> 64 convey virtually no benefit over ones of 40, 37 or any other
> size. There's some extra grobble in the fetcher, but instruction
> streams are (a) streams and (b) read-only. If you allow code
> to modify the next few instructions, then things are different,
> but nobody sane allows that nowadays.

Except, of course, for Intel and AMDx86.

Andy "Krazy" Glew

unread,

Nov 4, 2010, 2:24:26 AM11/4/10

to

http://semipublic.comp-arch.net/wiki/Funky_Fixed_Width_Instructions_like_24,_40,_or_42_bits

nm...@cam.ac.uk

unread,

Nov 4, 2010, 4:16:44 AM11/4/10

to

In article <m76dnfht3Ye9gE_R...@giganews.com>,

I did say "sane". And, when I last looked, "allow" was rather
over-stating the case - though I accept that they don't actually
block it in the way that saner designs do.

Regards,
Nick Maclaren.

Quadibloc

unread,

Nov 4, 2010, 7:15:50 AM11/4/10

to

On Nov 2, 11:40 pm, "Andy \"Krazy\" Glew" <a...@SPAM.comp-arch.net>
wrote:

> No need to get fancy.

>
> Just fetch ordinary cache lines.

But then sometimes the fetch of a single instruction word will trigger
the fetch of two cache lines instead of one. Bad, bad, bad!

Well, maybe not too bad, given that instructions are fetched
*sequentially* instead of *randomly*.

So it's only the rare jump that might trigger the fetch of two cache
lines.

But that's why, when I wanted to design an architecture that could
connect to power-of-two modern memory, and yet handle 36-bit or 60-bit
floats, I *did* illustrate how one could get fancy, in complete
detail.

Fetch memory in one 256-bit block at a time. Put it in cache lines
that contain 16 such blocks.

But if the machine is emulating a Control Data 6600 with 60-bit
floats? Fill a cache line with only 15 blocks. Put it through circuits
that turn every 15 64-bit units of storage into 16 60-bit units of
storage before passing the data on to the ALUs.

Or an IBM 7094 with 36-bit floats? Then fill the cache lines with 9
blocks, and choose another set of wires.

I show where the conversion layer has to be put, in relation to the
scatter-gather circuits that handle parcelling out data in 8-bit, 16-
bit, 32-bit, 64-bit, and so on, chunks to the ALU so that separate
scatter-gather isn't needed for the non-standard word lengths, over on
my web site, at

http://www.quadibloc.com/arch/ar0102.htm

John Savard

Terje Mathisen

unread,

Nov 4, 2010, 8:49:33 AM11/4/10

to

Andy "Krazy" Glew wrote:
> I think the IBM folk have proposed a fixed length 40-bit ISA that treats
> alll of *virtual* memory as an array. No overlapping instructions.
>
> (I may be projecting, because that's what the funny fixed length ISAs I
> have been involved in did.)

This is almost certainly true, due to the instruction # branch target
encoding, i.e. branch to 12345 means branch to the 40-bit opcode that
resides at virtual address 12345*5.

> I.e.
> virtual address bytes 0-4 are instruction 0
> 5-9 -> instruction 1
> ...
> 5n-5n+4 -> instruction n
> ...
> you are NOT allowed to have misaligned, overlapping, instructions,
> e.g. in virtual address bytes 2,3,4,5,6

You cannot have that, within a given virtual address map: Opcodes can
only reside at modulo 5 byte offsets.

> This leaves virtual address page mapping as the only way to get
> overlapping instructions. E.g. the same page might be mapped at both
> address 0 and 4K, meaning that bytes 0,1,2,3,4 of the page are an
> instruction when mapped at virtual address 0,
> and byte 0,1,2,3 form an instruction whose first byte is byte 4095 of
> the page mapped at 0, when the page is mapped at virtual address 4K (4096).
>

The solution is most probably "Don't do that!"

>
> To eliminate that source of overlapping instructions, one can imagine
> performing the mapping only with a page - say a 4K page.
>
> e.g. instruction number in page =
> (Address mod 4K) mod 5
> 0-4 -> 0
> 5-9 -> 1
> ...
> 4090-4094 -> 817
> byte 4091 is reserved, and cannot be used as part of an instruction.

No, this is horribly bad, since you have to do a mod 4095 on top of the
mul_by_5 operation, or div/mod 819 (not 817) first and then the mul.

div/mod 4095 is faster than 819 though.

>
> You have to modify the logic to increment instructions sequentially a
> bit. Plus, an instruction address tends to want to look like
> (VirtualAddressPageNumber,NumberOfInstructionInPage)

The real solution is quite simple:

All code space pages are colored so as to split them into 5 groups,
based on their mod 5 starting offset, and you cannot move code pages
from one group to another when remapping something.

This simply means that you'll load all programs and libraries to start
on a group zero (i.e. mod 4096*5 == 0) page, and then allocate more
pages as needed.

For the problems this machine is designed to handle, it really doesn't
matter if the code space is a little bit harder to setup optimally: The
programs will consist of far more data than code.

Terje Mathisen

unread,

Nov 4, 2010, 8:52:21 AM11/4/10

to

Not really: Yes, it still works, but at such a horrible performance cost
that no speed-critical code uses it anymore.

This means that as long as you have a way to determine if/when it
happens, you can use the same restart mechanism as you need for a branch
mispredict: In this case you have mispredicted the actual code bytes. :-)

nm...@cam.ac.uk

unread,

Nov 4, 2010, 9:14:48 AM11/4/10

to

In article <6tmaq7-...@ntp.tmsw.no>,

Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>Andy "Krazy" Glew wrote:
>> On 11/3/2010 1:08 PM, nm...@cam.ac.uk wrote:
>>
>>> However, my response was saying that instruction sizes of 32 or
>>> 64 convey virtually no benefit over ones of 40, 37 or any other
>>> size. There's some extra grobble in the fetcher, but instruction
>>> streams are (a) streams and (b) read-only. If you allow code
>>> to modify the next few instructions, then things are different,
>>> but nobody sane allows that nowadays.
>>
>> Except, of course, for Intel and AMDx86.
>>
>Not really: Yes, it still works, but at such a horrible performance cost
>that no speed-critical code uses it anymore.
>
>This means that as long as you have a way to determine if/when it
>happens, you can use the same restart mechanism as you need for a branch
>mispredict: In this case you have mispredicted the actual code bytes. :-)

I am disinclined to waste time studying that morass again, as I
regard anyone that relies on it as being demented beyond hope,
but I believe that what you say is a bit over-optimistic :-)

In particular, I wasn't certain if instruction retry was ever
required, and about the interactions with the ECC and TLB miss
handling. That was the point at which the older systems that
supported it stopped doing so :-)

That might SEEM to be a problem only when you overwrite the current
instruction, but can be a problem on any implementation that delays
actual writes beyond the instruction boundary (as many do).

But the worse problem was the interaction with parallelism (not
just multiple CPUs, but the floating-point units). My vague
recollection of what I saw was that it was effectively undefined.

Regards,
Nick Maclaren.

Andy "Krazy" Glew

unread,

Nov 4, 2010, 11:32:43 AM11/4/10

to

On 11/4/2010 5:49 AM, Terje Mathisen wrote:
> Andy "Krazy" Glew wrote:

>> To eliminate that source of overlapping instructions, one can imagine
>> performing the mapping only with a page - say a 4K page.
>>
>> e.g. instruction number in page =
>> (Address mod 4K) mod 5
>> 0-4 -> 0
>> 5-9 -> 1
>> ...

>> 4090-4094 -> 819
>> byte 4095 is reserved, and cannot be used as part of an instruction.

>
> No, this is horribly bad, since you have to do a mod 4095 on top of the
> mul_by_5 operation, or div/mod 819 (not 817) first and then the mul.

Typos. Arithmetic-os

> div/mod 4095 is faster than 819 though.
>>
>> You have to modify the logic to increment instructions sequentially a
>> bit. Plus, an instruction address tends to want to look like
>> (VirtualAddressPageNumber,NumberOfInstructionInPage)

Juxtapose:

"No, this is horribly bad, since you have to do a mod 4095 on top of the

mul_by_5 operation, or div/mod 819 first and then the mul."

"An instruction address tends to want to look like
(VirtualAddressPageNumber,NumberOfInstructionInPage)"

I.e. the latter, which came first, shows that the former, which came
last, need not be the case.

i.e. you don't need to do modulo 4095 or 819 on any interesting case.

Although I agree with you that I doubt that IBM is doing this. They are
just making ByteAddress = InstructionNumber*5 (maybe plus some offset).

Quadibloc

unread,

Nov 4, 2010, 12:05:31 PM11/4/10

to

On Nov 4, 6:49 am, Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:

> You cannot have that, within a given virtual address map: Opcodes can
> only reside at modulo 5 byte offsets.

If you do that, and this means "modulo 5 byte offsets in physical
memory", then since pages are in a power-of-two size, you can't swap
pages unless you have modulo-5 page locations.

Of course, the idea of skipping the 4096th word of every 4096 words
might help.

But my cache idea, while still not allowing overlapping instructions,
puts the mod 5 hardware in the L2 - (L1, ALU) link, and so one can
still settle on an arbitrary mod 5 beginning for any part of main
memory, even if one has to stick to one's choice while it's paged in
unless one wants to cache it twice.

John Savard

Andy "Krazy" Glew

unread,

Nov 4, 2010, 5:47:36 PM11/4/10

to

On 11/4/2010 9:05 AM, Quadibloc wrote:
> On Nov 4, 6:49 am, Terje Mathisen<"terje.mathisen at tmsw.no"> wrote:
>
>> You cannot have that, within a given virtual address map: Opcodes can
>> only reside at modulo 5 byte offsets.
>
> If you do that, and this means "modulo 5 byte offsets in physical
> memory", then since pages are in a power-of-two size, you can't swap
> pages unless you have modulo-5 page locations.

??

You could execute 1 page at a time, so long as you don't need the
instructions at beginning and end of page.

You can handle page crossing instructions with a bit of buffering, or by
having two pages, since no instruction will cross more than two pages.

There''s nothing that says that 5 byte instructions need any particular
number of pages.

> Of course, the idea of skipping the 4096th word of every 4096 words
> might help.

Thanks. I thought it was cute at the time.

>
> But my cache idea, while still not allowing overlapping instructions,
> puts the mod 5 hardware in the L2 - (L1, ALU) link, and so one can
> still settle on an arbitrary mod 5 beginning for any part of main
> memory, even if one has to stick to one's choice while it's paged in
> unless one wants to cache it twice.

I don't like the idea of depending on the cache.

One of Glew's Rules for Computer Architecture: your instruction set
should be able to work even if any or all of the caches are disabled.

Makes debugging easier.

George Neuner

unread,

Nov 5, 2010, 12:19:31 AM11/5/10

to

On Thu, 4 Nov 2010 04:15:50 -0700 (PDT), Quadibloc <jsa...@ecn.ab.ca>
wrote:

>On Nov 2, 11:40 pm, "Andy \"Krazy\" Glew" <a...@SPAM.comp-arch.net>
>wrote:
>
>> No need to get fancy.
>>
>> Just fetch ordinary cache lines.
>
>But then sometimes the fetch of a single instruction word will trigger
>the fetch of two cache lines instead of one. Bad, bad, bad!
>
>Well, maybe not too bad, given that instructions are fetched
>*sequentially* instead of *randomly*.
>
>So it's only the rare jump that might trigger the fetch of two cache
>lines.

Worst case is that the jump instruction straddles 2 cache lines and
targets a similarly straddling instruction. You have 3 lines fetched,
with 2 of them being almost completely wasted.

But you have a similar case if the jump is at the beginning of a line
and targets an instruction at the end of a line. This fetches only 1
additional line to execute the jump, but essentially still wastes 2
lines. Plus, regardless of what the target instruction does, a 3rd
(presumably useful) line will be immediately fetched. Moreover,
fixing the size and/or alignment of instructions can't prevent this
situation ... although it's possible that a different method of code
caching might.

George

George Neuner

unread,

Nov 5, 2010, 1:58:03 AM11/5/10

to

I expect to get jumped on but I have to ask ... this is, after all, a
supercomputer - why not just make memory 40-bits wide and leave
addressing alone? Make instruction access read/write all 40 bits and
data access read/write only 32.

DSP's use this trick to handle mixed sized accesses. They don't have
VMM of course, but if code and data addressing are symmetric, I can't
think of a reason wider memory would cause any problem other than the
paging mechanism would have to access the memory as instructions to
make sure all the bits are copied. The backing page on disk is 25%
larger, but who cares?

I used a DSP (ADSP-21061) that had 48-bit instructions and 32 and
40-bit data. Its memory was laid out so that each pair of 48-bit
instruction words was addressable as three 32-bit data words. It also
had 40-bit data that was addressed as an instruction but data bus was
not connected to the extra memory bits.

I understand why it is desirable to save memory by dual addressing,
but as you already noted, there isn't any nice LCM page size that
accommodates both 5-byte instructions and multiple-of-8 byte data.
It's simple enough to require that the last (partial) instruction word
on a standard 4KB page be unused, but why bother? If you insist on
dual addressing and try to overlay such disparately sized items, ISTM
it will be difficult to detect and deal with address and/or page
aliasing should there ever be any. It would be better to just say a
page is 4K addresses and leave it at that.

Ducking for cover ...
George

Del Cecchi

unread,

Nov 5, 2010, 11:03:48 PM11/5/10

to

"George Neuner" <gneu...@comcast.net> wrote in message
news:dc27d6hrmpceqlfjt...@4ax.com...

I am sure the memory access on this machine is way wider than 32 bits.
Like 256 plus ECC of a fancy sort. Package code and stuff. Hmm, I
wonder if one could use some of the ECC bits when instructions are
stored? Don't have to be correcting errors in instructions since the
right data is out there on the disk somewhere, right? (This idea
goes back a long way, as I recall)

George Neuner

unread,

Nov 6, 2010, 2:26:56 AM11/6/10

to

On Fri, 5 Nov 2010 22:03:48 -0500, "Del Cecchi" <delc...@gmail.com>
wrote:

>"George Neuner" <gneu...@comcast.net> wrote in message
>>

>> I expect to get jumped on but I have to ask ... this is, after all,
>> a supercomputer - why not just make memory 40-bits wide and leave
>> addressing alone? Make instruction access read/write all 40 bits
>> and data access read/write only 32.
>

>I am sure the memory access on this machine is way wider than 32 bits.
>Like 256 plus ECC of a fancy sort.

No doubt ... and I expect the data to be 64-bit (or more). But my
original question remains. Multiple word access can be done
regardless of word size. The internal code and data buses are of
different width already ... my point is simply that external buses
likewise do not need any symmetry.

As a software systems person, I foresee significant problems in trying
to mix VMM with dual addressing when the words are so disparately
sized. As I wrote previously, I have used VLIW chips with a 2:3 ratio
of instructions to data ... but those chips did not have VMM and this
IBM chip has at best a 5:8 ratio (assuming 64-bit data) which will
make code so different from data addressing that alias detection will,
I think, be difficult.

So having seen first hand that, by accepting a little waste, one can
make quite reasonable dual use of wide memory, I am asking why it
should not be considered.

George

Quadibloc

unread,

Nov 6, 2010, 10:34:10 PM11/6/10

to

On Nov 4, 11:58 pm, George Neuner <gneun...@comcast.net> wrote:

> I expect to get jumped on but I have to ask ... this is, after all, a
> supercomputer - why not just make memory 40-bits wide and leave
> addressing alone? Make instruction access read/write all 40 bits and
> data access read/write only 32.

Makes sense to me, but I can understand why they didn't do that. The
cost benefits of using only commodity parts as far as possible were
just too attractive to pass up. And you want to have only one pathway
between main memory and the swap drive.

John Savard

Quadibloc

unread,

Nov 6, 2010, 10:36:21 PM11/6/10

to

On Nov 4, 10:19 pm, George Neuner <gneun...@comcast.net> wrote:

> But you have a similar case if the jump is at the beginning of a line
> and targets an instruction at the end of a line.

That's a good point. There is still a difference in terms of the
latency of the branch target, but that's insignificant compared, to,
say, the costs of a cache miss.

John Savard

Scott Michel

unread,

Nov 6, 2010, 11:18:23 PM11/6/10

to

Brett Davis wrote:
>
> This is feather in my cap for predicting Cell style architectures
> are the future.

I'm not sure of how much victory you want to claim. To quote a noted
game developer, Cell is a brain f---. The complexity of S/W dev is
probably the reason for Cell's dismal record. The lack of anything
Cell at the last Supercomputing is a good indicator that it's defunct.

Of course, if IBM is selling to gov't, there are the Roadrunner
developers into which one could tap (though, my understanding is that
many have moved on to other institutions)

Speaking of LLVM, I should probably recommend retiring the SPU
backend. Haven't worked on it in more than a year and my employer
retired the Bladecenter space heater a few months ago. Yeah, that
backend was a productive experiment...

Scott Michel

unread,

Nov 6, 2010, 11:36:30 PM11/6/10

to

Michael S

unread,

Nov 7, 2010, 6:37:39 AM11/7/10

to

On Nov 6, 8:26 am, George Neuner <gneun...@comcast.net> wrote:
>
> So having seen first hand that, by accepting a little waste, one can
> make quite reasonable dual use of wide memory, I am asking why it
> should not be considered.
>
> George

May be, because, as explained above by Andy and Terje, the problem,
you are trying to solve, simply does not exist?

Michael S

unread,

Nov 7, 2010, 6:51:17 AM11/7/10

to

On Nov 4, 5:32 pm, "Andy \"Krazy\" Glew" <a...@SPAM.comp-arch.net>
wrote:
>

> Although I agree with you that I doubt that IBM is doing this. They are
> just making ByteAddress = InstructionNumber*5 (maybe plus some offset).

IMHO even more likely that they don't try to save log2(5)=2.32 bits at
all, i.e. instruction pointer and associated instruction fields
(except, probably, B-form conditional branch instructions, where the
space in the BD field of the instruction word is relatively precious)
contain full byte address rather than Instruction Number.

Niels Jørgen Kruse

unread,

Nov 7, 2010, 6:50:41 AM11/7/10

to

Michael S <already...@yahoo.com> wrote:

Considering that their slides contradict you, I think it unlikely.

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark

Quadibloc

unread,

Nov 7, 2010, 8:31:04 AM11/7/10

to

On Nov 6, 8:18 pm, Scott Michel <scooter....@gmail.com> wrote:

> I'm not sure of how much victory you want to claim.

Of course, you could both be right.

Cell-style designs might be very inefficient - a CPU that is cut off
from main memory, and which can only work with its own limited
internal memory, indeed, is much harder to get to do useful work.

But they could still be the wave of the future - because as we put
more and more transistors on a chip, the ratio between CPU power and
cache size on the chip, on the one hand, and bandwidth to talk to
external memory on the outside on the other, may just keep increasing.

Just as we have N cores on a chip instead of one core that is N times
faster - not because that's *better*, but because that's what we're
*stuck with*, so the future could well belong to designs like the Cell
for similar reasons.

John Savard

Terje Mathisen

unread,

Nov 7, 2010, 9:29:48 AM11/7/10

to

I think this is silly:

The code fetch unit obviously reads in full cache lines, and it must be
at least double-buffered, so that it can load the next sequential line
while the rest of the pipeline is using the instructions in the previous
line, right?

This means that for sequential code, a 5-byte instruction that straddles
two cache lines is more or less a non-issue, and for a mis-predicted
branch, the fact that you sometimes have to fetch two code lines to get
the first instruction is a _very_ minor issue compared to the branch
miss itself.

I.e. this machine will do perfectly OK with a normal memory layout, and
pretty much all the mul_by_5 circuitry isolated to the fetch unit. (It
will probably maintain a shadow instruction pointer which counts bytes?)

For JIT type code the compiler might sometimes have to convert from byte
to instruction offset, i.e. div by 5, but this is simply a reciprocal
multiplication.

Scott Michel

unread,

Nov 7, 2010, 3:08:25 PM11/7/10

to

(Normally I reply in-line.)

You all just don't get it: the design is slick, but if you're relying
on a software dev to juggle the explicit memory hierarchy management,
then you've got an EPIC FAIL for current commercial languages.

I don't see a major shift in languages, just incremental improvements.
Intel's latest round of tools make it somewhat better for the chip SMP
developer, but hybrid m-core is still an unsolved problem. CUDA and
it's standard OpenCL makes life almost bearable for the GPGPU
developer. There's less memory hierarchy managment. Problem
partitioning is still up to the developer, who, unlike an algorithm,
tends to be suboptimal.

nm...@cam.ac.uk

unread,

Nov 7, 2010, 3:24:48 PM11/7/10

to

In article <bbd79ff7-a6d7-465e...@z17g2000prz.googlegroups.com>,

Scott Michel <scoot...@gmail.com> wrote:
>
>You all just don't get it: the design is slick, but if you're relying
>on a software dev to juggle the explicit memory hierarchy management,
>then you've got an EPIC FAIL for current commercial languages.

You just don't get it: if you are relying on hardware development
to produce ANY significant performance improvement without major
changes to current commercial languages then you've got an EPIC FAIL.

Your point is good, but one-sided. We are on the horns of a
dilemma.

Regards,
Nick Maclaren.

Andy "Krazy" Glew

unread,

Nov 7, 2010, 9:25:58 PM11/7/10

to

Actually, the paper said that the instruction pointer contained an
instruction number, not a byte address.

I doubt the motivation was saving bits.

I suspect the motivation may have been to avoid the possibility of
overlapping instructions.

I might have been tempted to have a byte IP, but enforce it being a
multiple of 5. And use relative instruction numbers in branches.

Terje Mathisen

unread,

Nov 8, 2010, 4:54:05 AM11/8/10

to

Andy "Krazy" Glew wrote:
> Actually, the paper said that the instruction pointer contained an
> instruction number, not a byte address.
>
> I doubt the motivation was saving bits.
>
> I suspect the motivation may have been to avoid the possibility of
> overlapping instructions.

As we've written previously, this only works within a single VM mapping,
unless they also employ the 5-way code page coloring I suggested.

>
> I might have been tempted to have a byte IP, but enforce it being a
> multiple of 5. And use relative instruction numbers in branches.

Cycle-by-cycle only the instruction fetch unit needs the byte IP, right?

The choice comes down to either converting instr. nr to byte offset on
every fetch, or to maintain a shadow byte IP pointer which gets
incremented by 5 on every regular instruction and reloaded after each
taken branch.

The right choice is probably the one that uses the least power.

Having relative instruction numbers in branches is not just obvious, it
seems like the only sane way of doing it.

If you don't convert to byte addr on every fetch, then it might be a win
to have the branch prediction table contain a byte IP as well as the
instr. nr.

Terje

PS. It might be possible that they use a base address for code, so that
to get from instruction number to byte address you multiply by 5 and add
in the code base?

George Neuner

unread,

Nov 8, 2010, 12:47:07 PM11/8/10

to

??? Both Andy and Terje remarked on the problems caused by disparate
dual addressing in the same memory space. AFAIHS, the only thing
either of them said relevant to VMM was the possibility to forgo using
the last word of a page for code (presuming 4KB pages to begin with).

George.

Scott Michel

unread,

Nov 8, 2010, 12:51:21 PM11/8/10

to

On Nov 7, 12:24 pm, n...@cam.ac.uk wrote:
> In article <bbd79ff7-a6d7-465e-903f-7ed653666...@z17g2000prz.googlegroups.com>,

> Scott Michel <scooter....@gmail.com> wrote:
>
>
>
> >You all just don't get it: the design is slick, but if you're relying
> >on a software dev to juggle the explicit memory hierarchy management,
> >then you've got an EPIC FAIL for current commercial languages.
>
> You just don't get it: if you are relying on hardware development
> to produce ANY significant performance improvement without major
> changes to current commercial languages then you've got an EPIC FAIL.
>
> Your point is good, but one-sided. We are on the horns of a
> dilemma.

Nick:

You're excellently arguing my point. I happen to firmly believe that
functional languages are the way to go because they balance expressing
the algorithm with allowing the compiler to analyze data flow. I don't
think that the industry will go back and improve Common Lisp, inasmuch
as I'm still a fan of that language; Haskell seems to be the language
of choice for the functional programmer these days. However, the
academic nomenclature embedded in the language hampers its adoption.

-scooter

nm...@cam.ac.uk

unread,

Nov 8, 2010, 1:14:51 PM11/8/10

to

In article <d353d9a1-e44d-4fe0...@o15g2000prh.googlegroups.com>,

Scott Michel <scoot...@gmail.com> wrote:
>
>You're excellently arguing my point. I happen to firmly believe that
>functional languages are the way to go because they balance expressing
>the algorithm with allowing the compiler to analyze data flow. I don't
>think that the industry will go back and improve Common Lisp, inasmuch
>as I'm still a fan of that language; Haskell seems to be the language
>of choice for the functional programmer these days. However, the
>academic nomenclature embedded in the language hampers its adoption.

That is truly a blast from the past! That is EXACTLY what I was
being told by 'computer scientists' in 1975.

While I and most people agree that more functionality (in that sense)
is good, for many reasons, what you say is regrettably the converse
of the known facts. By 1980, we knew that functional languages were
(a) not powerful enough to handle certain critical requirements,
(b) did not help compilers and other automatic tools to extract
parallelism and (c) these restrictions were mathematically
fundamental.

I know that the definition of "functional programming languages"
has been changed to allow limited object update, which means that
the extended class is rather better at (a), but it positively
doesn't help with (b). We need a different class of language for
parallel performance.

Personally, I favour dataflow, which has been sadly neglected, but
I don't see a hope in hell of that being accepted in my lifetime.

Regards,
Nick Maclaren.

George Neuner

unread,

Nov 8, 2010, 1:35:58 PM11/8/10

to

On Sat, 6 Nov 2010 19:34:10 -0700 (PDT), Quadibloc <jsa...@ecn.ab.ca>
wrote:

>On Nov 4, 11:58 pm, George Neuner <gneun...@comcast.net> wrote:

I mentioned 32/40-bit as an example and I (ass)(u)(me)d everyone here
would understand it was *just* an example. I am well aware that the
machine likely would be built on OTS 64-bit wide parts.

You don't need custom parts, just some creative wiring. The idea is
to split off the odd bytes and store them in a separate memory bank
(or banks). Each OTS 64-bit part delivers 8 bytes per access, but 10
bytes are needed for code. So theoretically, one more 64-bit part
could store 4 sets of 2 extra code bytes. But in reality that 5th
part would become a bottleneck due to back-to-back access, so a real
system doing this would use at least 2 extra parts and alternate them
based on which (main) bank was accessed. Since 2 parts can hold 8
pairs of extra bytes, you'd also double to 8 main banks.
[And expand from there in multiples of 8 banks in 10 parts.]

Anyway, I get that this is an intellectual exercise for which nobody
here cares. I expected to be soundly trounced ... instead there was
barely any response at all.
George

Michael S

unread,

Nov 8, 2010, 2:26:19 PM11/8/10

to

On Nov 8, 7:47 pm, George Neuner <gneun...@comcast.net> wrote:
> On Sun, 7 Nov 2010 03:37:39 -0800 (PST), Michael S
>

No. Read Terje's response: "The solution is most probably "Don't do
that!""
Just like with VL ISAs - technically, nothing prevents user from
jumping into the middle of instruction, except that user normally has
no good reasons for doing it.

nm...@cam.ac.uk

unread,

Nov 8, 2010, 3:15:57 PM11/8/10

to

In article <1730222f-9e0c-4692...@v20g2000yqb.googlegroups.com>,

Michael S <already...@yahoo.com> wrote:
>
>Just like with VL ISAs - technically, nothing prevents user from
>jumping into the middle of instruction, except that user normally has
>no good reasons for doing it.

Ah! That takes me back .... You have 102 bytes free - just how
much functionality can you fit into that? Instruction sequences
that did two things according to where you entered etc. I never
actually wrote any code that executed the second half of an
instruction, though I did use instructions as data :-)

Regards,
Nick Maclaren.

MitchAlsup

unread,

Nov 8, 2010, 5:44:48 PM11/8/10

to

This 40-bit instruction cache seems like a really straightforward
thing to do.

Arrange that the cache is 8 SRAM (or any larger power of 2) 32-bit
blocks wide (or any larger multiple (64,...) you desire).
If you want 4 instructions, simply read 5 SRAM blocks,...
Increment the IP by the number of instructs read.
If you don't want instructions spanning a 4096 page boundary, build a
(2**12-1) detector on the instruction pointer and use it to increment
the IP to the next 4096 boundary.

Mitch

Andrew Reilly

unread,

Nov 8, 2010, 6:23:04 PM11/8/10

to

On Mon, 08 Nov 2010 18:14:51 +0000, nmm1 wrote:

> While I and most people agree that more functionality (in that sense) is
> good, for many reasons, what you say is regrettably the converse of the
> known facts. By 1980, we knew that functional languages were (a) not
> powerful enough to handle certain critical requirements, (b) did not
> help compilers and other automatic tools to extract parallelism and (c)
> these restrictions were mathematically fundamental.
>
> I know that the definition of "functional programming languages" has
> been changed to allow limited object update, which means that the
> extended class is rather better at (a), but it positively doesn't help
> with (b). We need a different class of language for parallel
> performance.

The languages themselves might not help very much with automatic (b), but
it is certainly seeming to be the case that much of the referential
transparency that comes from a functional style of coding (whatever the
language) is helping a lot with some human-inserted parallelism. (See
work with futures in scheme, and similar design choices.) Remains to be
seen how/if that scales out to mega-node machines, but it seems to be a
move in the right direction.

Some of the ideas in Fortress seemed to be heading in the right direction
too, although I doubt that's making much headway at the moment.

IBM had a language in the DARPA productivity project, too. No doubt they
have their own ideas about these sorts of issues.

When you're building funky cell-shaped scale-out processor hardware, I
think that it goes without saying that you don't give a fig about legacy
code bases, and probably could care less about legacy languages. How
much that limits your project's chances of success is surely a matter of
your project management skills.

Cheers,

--
Andrew

Robert Myers

unread,

Nov 8, 2010, 6:37:17 PM11/8/10

to

On Nov 8, 1:14 pm, n...@cam.ac.uk wrote:

>
> Personally, I favour dataflow, which has been sadly neglected, but
> I don't see a hope in hell of that being accepted in my lifetime.
>

Don't people write functional dataflow programs every day, only they
call them spreadsheets?

Robert.

nm...@cam.ac.uk

unread,

Nov 8, 2010, 6:47:31 PM11/8/10

to

In article <ebb7a041-0c6d-440b...@o15g2000prh.googlegroups.com>,

Robert Myers <rbmye...@gmail.com> wrote:
>>
>> Personally, I favour dataflow, which has been sadly neglected, but
>> I don't see a hope in hell of that being accepted in my lifetime.
>
>Don't people write functional dataflow programs every day, only they
>call them spreadsheets?

For very weak meanings of the words "functional", "dataflow" and
"programs", yes.

I am, however, talking about proper, Turing-complete programming
languages, and programming the applications that people currently
use explicitly parallel C++ or Fortran for. Current spreadsheets
just don't cut the mustard.

Regards,
Nick Maclaren.

Robert Myers

unread,

Nov 8, 2010, 9:03:33 PM11/8/10

to

On Nov 8, 6:47 pm, n...@cam.ac.uk wrote:
> In article <ebb7a041-0c6d-440b-a4da-7c32acb05...@o15g2000prh.googlegroups.com>,

> Robert Myers <rbmyers...@gmail.com> wrote:
>
>
>
> >> Personally, I favour dataflow, which has been sadly neglected, but
> >> I don't see a hope in hell of that being accepted in my lifetime.
>
> >Don't people write functional dataflow programs every day, only they
> >call them spreadsheets?
>
> For very weak meanings of the words "functional", "dataflow" and
> "programs", yes.
>
> I am, however, talking about proper, Turing-complete programming
> languages, and programming the applications that people currently
> use explicitly parallel C++ or Fortran for. Current spreadsheets
> just don't cut the mustard.
>

It was a throwaway comment, Nick.

Still, given the success of the model, however flawed, and the time
wasted on experiments that seemed designed to do little more than to
satisfy the arbitrary prejudices of some academic, I guess I have to
say I'm surprised that no one has even played with it.

Robert.

Jason Riedy

unread,

Nov 8, 2010, 11:07:20 PM11/8/10

to

And Andrew Reilly writes:
> IBM had a language in the DARPA productivity project, too. No doubt they
> have their own ideas about these sorts of issues.

It's X10 ( http://x10-lang.org ), and not in the slightest functional.
The scoped synchronization primitives and intentional asynchronous /
weakened consistency model are interesting, however.

In many ways, X10 *wants* to be Haskell but lacks the courage of an
academic's conviction (aka tenure). There is nifty hardware out there
for which Haskell would be a perfect fit, but their funding folks aren't
interested.

Json

Terje Mathisen

unread,

Nov 9, 2010, 1:48:01 AM11/9/10

to

The opposite is even easier, back in the 486 days I wrote several
(speed-optimized) functions which gained a cycle or two by using the
immediate part of one instruction as the branch target of the core loop.

The reason for doing it was that the optimal inner loop code was
somewhat skewed, so that the first iteration needed to jump past the top
of it: Instead of taken branch I used a CMP instruction where the
immediate data was really the initial instruction(s).

Terje

Terje Mathisen

unread,

Nov 9, 2010, 1:59:51 AM11/9/10

to

MitchAlsup wrote:
> This 40-bit instruction cache seems like a really straightforward
> thing to do.
>
> Arrange that the cache is 8 SRAM (or any larger power of 2) 32-bit
> blocks wide (or any larger multiple (64,...) you desire).
> If you want 4 instructions, simply read 5 SRAM blocks,...
> Increment the IP by the number of instructs read.

This makes perfect sense.

> If you don't want instructions spanning a 4096 page boundary, build a
> (2**12-1) detector on the instruction pointer and use it to increment
> the IP to the next 4096 boundary.

But here I think you're wrong: Even if it is easy to do, it is far
easier to have tiny code prefetch fifo which totally hides those cache
line boundaries.

If I simply _had_ to avoid block straddles, I would not use special hw
to detect such addresses, since that would make conversion from
instruction nr to byte address much harder:

I would probably prefer to dedicate a single 8-bit instruction prefix to
NOP instructions, and then simply throw away any one instruction slot
for each straddling boundary, i.e. 4 times in 4096 instructions/20480
bytes for a total overhead of less than 0.1%.

This would be forward compatible with any increase in page size, and if
you in the future had a cpu version with that fetch fifo, it would still
run old code while a new compiler could skip the NOP padding of "sharp
corners".

Terje Mathisen

unread,

Nov 9, 2010, 2:02:17 AM11/9/10

to

George Neuner wrote:
> On Sun, 7 Nov 2010 03:37:39 -0800 (PST), Michael S

>> May be, because, as explained above by Andy and Terje, the problem,
>> you are trying to solve, simply does not exist?
>
> ??? Both Andy and Terje remarked on the problems caused by disparate
> dual addressing in the same memory space. AFAIHS, the only thing
> either of them said relevant to VMM was the possibility to forgo using
> the last word of a page for code (presuming 4KB pages to begin with).

Rather the opposite: Andy suggested that as a possibility, and I said
that I'm sure they will instead allocate code pages in 5 groups, so that
each program/module is loaded at offset zero into a page.

This is easy, cheap and efficient, as well as close to zero overhead for
a machine designed to run a very small number (normally just one) of
programs simultaneously.

Andrew Reilly

unread,

Nov 9, 2010, 2:28:43 AM11/9/10

to

On Mon, 08 Nov 2010 23:07:20 -0500, Jason Riedy wrote:

> And Andrew Reilly writes:
>> IBM had a language in the DARPA productivity project, too. No doubt
>> they have their own ideas about these sorts of issues.
>
> It's X10 ( http://x10-lang.org ), and not in the slightest functional.
> The scoped synchronization primitives and intentional asynchronous /
> weakened consistency model are interesting, however.

Thanks for the link reminder. I never got as far as looking into X10.
Fortress is quite functional, which isn't too surprising.

> In many ways, X10 *wants* to be Haskell but lacks the courage of an
> academic's conviction (aka tenure). There is nifty hardware out there
> for which Haskell would be a perfect fit, but their funding folks aren't
> interested.

Haskell has a lot going for it, I think. Still a lot of gotchas too, of
course. Can't have everything, all at once.

Cheers,

--
Andrew

nm...@cam.ac.uk

unread,

Nov 9, 2010, 3:39:16 AM11/9/10

to

In article <8jsbhb...@mid.individual.net>,

Andrew Reilly <areil...@bigpond.net.au> wrote:
>On Mon, 08 Nov 2010 23:07:20 -0500, Jason Riedy wrote:
>
>> And Andrew Reilly writes:
>>> IBM had a language in the DARPA productivity project, too. No doubt
>>> they have their own ideas about these sorts of issues.
>>
>> It's X10 ( http://x10-lang.org ), and not in the slightest functional.
>> The scoped synchronization primitives and intentional asynchronous /
>> weakened consistency model are interesting, however.
>
>Thanks for the link reminder. I never got as far as looking into X10.
>Fortress is quite functional, which isn't too surprising.

I looked at those, and found Fortress and Chapel very boring; while
they weren't bad languages, it was had to see why they were expected
to be any more parallelisable than (say) Fortran. X10 was definitely
innovative but, upon reading it, I have my doubts about its claims
of generality, its usability and (slightly) its safety. However,
please note 'doubts' - I may be misjudging it, and mean to try it
out if I ever get time and an opportunity.

Regards,
Nick Maclaren.

nm...@cam.ac.uk

unread,

Nov 9, 2010, 3:45:14 AM11/9/10

to

In article <80874929-a5f3-4453...@j2g2000yqf.googlegroups.com>,
Robert Myers <rbmye...@gmail.com> wrote:
>
> [ dataflow languages ]

>
>It was a throwaway comment, Nick.

Consider it thrown away :-)

>Still, given the success of the model, however flawed, and the time
>wasted on experiments that seemed designed to do little more than to
>satisfy the arbitrary prejudices of some academic, I guess I have to
>say I'm surprised that no one has even played with it.

Oh, they have, but not much more than played with it. X10, for
example, can be said to have some aspects of dataflow in it. I can't
remember which now, but it has been used on a couple of specialist
languages (including a simulation one, if I recall). But the
mainstream has stayed solidly imperative or (sort-of) functional.

Regards,
Nick Maclaren.

MitchAlsup

unread,

Nov 9, 2010, 9:52:21 AM11/9/10

to

On Nov 9, 12:59 am, Terje Mathisen <"terje.mathisen at tmsw.no">
wrote:

> MitchAlsup wrote:
> > This 40-bit instruction cache seems like a really straightforward
> > thing to do.
>
> > Arrange that the cache is 8 SRAM (or any larger power of 2) 32-bit
> > blocks wide (or any larger multiple (64,...) you desire).
> > If you want 4 instructions, simply read 5 SRAM blocks,...
> > Increment the IP by the number of instructs read.
>
> This makes perfect sense.
>
> > If you don't want instructions spanning a 4096 page boundary, build a
> > (2**12-1) detector on the instruction pointer and use it to increment
> > the IP to the next 4096 boundary.
>
> But here I think you're wrong: Even if it is easy to do, it is far
> easier to have tiny code prefetch fifo which totally hides those cache
> line boundaries.

I will agree with you, here.

> If I simply _had_ to avoid block straddles, I would not use special hw
> to detect such addresses, since that would make conversion from
> instruction nr to byte address much harder:

Did you realize that you have defned 'harder' as: 3*4-inut NANDs, 1*3-
Input NOR wired to the cary-in of an incrementer?

Mitch

Michael S

unread,

Nov 9, 2010, 11:03:05 AM11/9/10

to

So how are you going to define the binary contest of BD field in B-
form format?
Something like (x/819)*1024 + (x % 819) ?
I'd rather not have ugly formulas like that being part of the ISA.

Andy "Krazy" Glew

unread,

Nov 9, 2010, 12:39:42 PM11/9/10

to

And they run them on dataflow processors every day - every modern
out-of-order processor is a dataflow machine internally.

Quadibloc

unread,

Nov 9, 2010, 12:41:01 PM11/9/10

to

On Nov 8, 11:35 am, George Neuner <gneun...@comcast.net> wrote:
> The idea is
> to split off the odd bytes and store them in a separate memory bank
> (or banks).

It's true that for a supercomputer, having to buy memory sticks in
sets of five would not be anything people would whine about.

John Savard

Terje Mathisen

unread,

Nov 9, 2010, 1:10:49 PM11/9/10

to

Sure!

The hard problem is the need to modify every relative branch offset:

Instead of simply doing byte_offset = instr_offset * 5, you must now
also determine how many pages the instruction offset will skip over:

target_byte_ip = current_byte_ip + instr_offset * 5;
pages_crossed = (target_byte_ip - current_byte_ip) >> 12;
target_byte_ip += pages_crossed;
if (((target_byte_ip - current_byte_ip) >> 12) > pages_crossed) {
target_byte_ip++;
}

There are obviously faster way to do this, but it is still an expensive
operation: You _can_ solve it by having both ip and byte_ip stored in
the branch predictor, but that's a lot of extra bits, particularly
compared with a fifo or even NOP padding.

Robert Myers

unread,

Nov 9, 2010, 4:25:41 PM11/9/10

to

On Nov 8, 1:14 pm, n...@cam.ac.uk wrote:

> We need a different class of language for
> parallel performance.

Here's another throwaway thought for you to throw away: suppose the
class of language you are imagining doesn't exist because it can't
exist for very fundamental reasons?

Sliding over into troll/flamebait land:

Two possibilities I see for "reality" right now:

1. Reality only appears to be non-local. Many independent, local,
embarrassingly-parallel processes create the appearance of non-
locality.

Possibility number one is the unspoken foundational assumption of
almost all currently "scalable," um, "supercomputers." It's the only
kind of high-speed parallel computation we really know how to do, and
it may be the only kind we will ever know how to do.

Everyone already knows what I think of conjecture #1 and the style of
computation it implies.

2. Reality has a long history of subtle, quantum mechanical (or
quantum mechanical-like) entanglements that we barely understand and
that have been evolving for billions of years.

Many, many developments in mathematical physics slide over a key step
that boils down to: "chaotic, non-linear systems forget the past and
become random." Suppose "chaotic," non-linear systems always remember
the past in really important ways?

"Reality," in this case, would mean practically any interesting system
we wanted to simulate.

If #2 is the correct view, you don't need massively parallel,
"scalable" computers. You need to compute for a very, very long time--
comparable to the age of the universe.

If #2 is the correct view, it is very bad news for HPC. If there is a
#3, I can't think of it.

Robert.

nm...@cam.ac.uk

unread,

Nov 9, 2010, 4:56:59 PM11/9/10

to

In article <9dc7523e-45c8-4ef7...@z17g2000prz.googlegroups.com>,

Robert Myers <rbmye...@gmail.com> wrote:
>
>> We need a different class of language for
>> parallel performance.
>
>Here's another throwaway thought for you to throw away: suppose the
>class of language you are imagining doesn't exist because it can't
>exist for very fundamental reasons?

That is possible. What I am imagining is certainly feasible; whether
it will help significantly with the problems I think that it will is
another matter entirely!

>Sliding over into troll/flamebait land:
>
>Two possibilities I see for "reality" right now:
>
>1. Reality only appears to be non-local. Many independent, local,
>embarrassingly-parallel processes create the appearance of non-
>locality.

Dubious, to put it mildly.

>2. Reality has a long history of subtle, quantum mechanical (or
>quantum mechanical-like) entanglements that we barely understand and
>that have been evolving for billions of years.

Ditto. There is little evidence that entanglements are spatially
non-local or temporarily long-lived.

>Many, many developments in mathematical physics slide over a key step
>that boils down to: "chaotic, non-linear systems forget the past and
>become random." Suppose "chaotic," non-linear systems always remember
>the past in really important ways?

You've just turned the whole of modern science inside out, and
restored sympathetic magic to the forefront. OK. I can live with
that :-)

>If #2 is the correct view, you don't need massively parallel,
>"scalable" computers. You need to compute for a very, very long time--
>comparable to the age of the universe.

Nah. You need to move up a level - metamathematics, or perhaps
metamagic - SOP in pure mathematics, pity about the number of
people capable of doing it (and I am not one).

Regards,
Nick Maclaren.

Robert Myers

unread,

Nov 9, 2010, 6:04:13 PM11/9/10

to

On Nov 9, 4:56 pm, n...@cam.ac.uk wrote:
> In article <9dc7523e-45c8-4ef7-9619-fe817a8d6...@z17g2000prz.googlegroups.com>,

Maybe it's just more complicated mathematics than people are prepared
to deal with at the moment. Call it magic if you like. Some of the
more pedestrian aspects of this magic already have names: kurtosis,
cumulants, non-stationarity, non-ergodicity. Maybe I could sum it up
as the extremely limited relevance of the law of large numbers.

The fact that we don't (or that most people don't) know what to do
with math not taught to almost anyone doesn't mean it isn't
important. The relevance to computation is that enormous computers
have led to the dangerous illusion that, if you just do enough of the
same naive thing, you'll eventually get a meaningful answer. A
technological culture that believes in that kind of magic will end up
believing it can create wealth by slicing garbage dumps into tranches.

Please accept my apologies. Be warned: an invasion of programming
language speculation into a hardware group will set me off on this
kind of tangent. You can't compensate for ignorance by inventing ever
more elaborate computer science games.

Robert.

Jason Riedy

unread,

Nov 9, 2010, 6:42:54 PM11/9/10

to

And nm...@cam.ac.uk writes:
> I looked at those, and found Fortress and Chapel very boring; [...]

Fortress seems more about the programmer interface than the machine
one. I'm not against that focus, but I agree that Fortress only seems
to apply to solved problems.

Chapel is, um, a little more interesting. I'm not sure how much,
though. The current focus is on supporting heterogeneous computing
directly. Chapel seems to permit parallel code, but I'm not sure how
much it *helps* when creating it.

> X10 was definitely innovative but, upon reading it, I have my doubts
> about its claims of generality, its usability and (slightly) its
> safety.

HA! You need to read about X10 to doubt a perfect trade-off? ;)

It appears that X10 is sound. The proofs I've managed to read have
appeared correct to me. Now, those proofs place enough restrictions on
the program to push back on your other two points. But you already know
that consequence. X10 is pushing on the type system from below.
Definitely more interesting than I previously believed.

And I think Simulink is one of the data-flow simulators you remember.
Moved MathWorks's products out of the academic setting and into actual
use. Others include widely used data analysis packages (e.g. IBM's Data
Explorer, Khoros/Cantata) and multimedia frameworks (e.g. gstreamer).
Data-flow succeeds well as a way to couple pieces. The issue is just
the granularity of the pieces...

Jason

(full disclosure: I've received some X10-related funding, so X10 has had
more of my attention lately. And there is Chapel work through CASS-MT,
although I'm not involved in that task.)

nm...@cam.ac.uk

unread,

Nov 9, 2010, 6:55:57 PM11/9/10

to

In article <87aalia...@NaN.sparse.dyndns.org>,

Jason Riedy <ja...@acm.org> wrote:
>
>HA! You need to read about X10 to doubt a perfect trade-off? ;)

Not really :-)

>It appears that X10 is sound. The proofs I've managed to read have
>appeared correct to me. Now, those proofs place enough restrictions on
>the program to push back on your other two points. But you already know
>that consequence. X10 is pushing on the type system from below.
>Definitely more interesting than I previously believed.

Yup.

>And I think Simulink is one of the data-flow simulators you remember.

No, they were mostly the previous generation. 1970s.

Regards,
Nick Maclaren.

Andy "Krazy" Glew

unread,

Nov 9, 2010, 7:06:29 PM11/9/10

to

For that matter, some GPUs are also dataflow machines internally, but
operatimg on separate threads waiting on memory, raher than individual
instructions.

Brett Davis

unread,

Nov 9, 2010, 9:37:40 PM11/9/10

to

In article <befoq7-...@ntp.tmsw.no>,

A related issue is PC relative data loads...

I went down in flames when I suggested that PC relative loads might no
longer serve a useful purpose.

George Neuner

unread,

Nov 10, 2010, 1:18:41 PM11/10/10

to

On Mon, 8 Nov 2010 09:51:21 -0800 (PST), Scott Michel
<scoot...@gmail.com> wrote:

>On Nov 7, 12:24�pm, n...@cam.ac.uk wrote:
>> In article <bbd79ff7-a6d7-465e-903f-7ed653666...@z17g2000prz.googlegroups.com>,
>> Scott Michel �<scooter....@gmail.com> wrote:
>>
>> >You all just don't get it: the design is slick, but if you're relying
>> >on a software dev to juggle the explicit memory hierarchy management,
>> >then you've got an EPIC FAIL for current commercial languages.
>>
>> You just don't get it: if you are relying on hardware development
>> to produce ANY significant performance improvement without major
>> changes to current commercial languages then you've got an EPIC FAIL.
>>
>> Your point is good, but one-sided. �We are on the horns of a
>> dilemma.
>
>Nick:
>
>You're excellently arguing my point. I happen to firmly believe that
>functional languages are the way to go because they balance expressing
>the algorithm with allowing the compiler to analyze data flow. I don't
>think that the industry will go back and improve Common Lisp, inasmuch
>as I'm still a fan of that language; Haskell seems to be the language
>of choice for the functional programmer these days. However, the
>academic nomenclature embedded in the language hampers its adoption.
>
>-scooter

Lisp will forever support imperative programming - no one is willing
to change that. Moreover the most prevalent dialect - Common Lisp -
has so over-specified evaluation semantics that - apart from multiple
threads - it is guaranteed to remain essentially a serial language.

Pure functional code has latent parallelism at all levels, but
Haskell's call-by-need execution model limits the amount of coarse
grain (suitable for auto-threading) parallelism that can be extracted
from Haskell code. Unfortunately, Haskell's formal semantics are
somewhat bound to cal-by-need execution - enough so that a more
aggressive execution model is not really possible.

IMO the best execution model for functional code (pure or not) is
speculative predicated memoized call-by-name operating over transacted
memory. This is essentially an active "actor" model that would enable
speculative execution of any code for which required preconditions -
program state, input ready, etc. - have been met. But it isn't
reasonable to blindly execute anything that is "ready" - I think that
as part of the preconditions, there needs to be some notion of being
on the "probable path" to limit speculation to things that are
somewhat likely to be executed anyway given the current state of the
program.

This, I think, is the best way to make use of the many-core machines
that are expected to shortly be forthcoming - that is unless some
process breakthrough suddenly puts serial execution rate increases
back on track ... which I really don't expect to happen any time soon.

I also share Nick Maclaren's fascination with dataflow, but at the
language level I think dataflow is a more radical paradigm shift for
current programmers than the shift to functional languages. However,
as an underlying execution model, dataflow is a very nice fit with the
speculative predicate model I outlined above - at least in theory
dataflow channels would provide an excellent auto-synchronizing
communication subsystem for the actors.

I think that's enough rambling for now 8-)
George

MitchAlsup

unread,

Nov 10, 2010, 2:47:42 PM11/10/10

to

On Nov 9, 6:06 pm, "Andy \"Krazy\" Glew" <a...@SPAM.comp-arch.net>
wrote:

And in both cases, the dataflow-ness is of the restricted kind--
restricted to look and smell like a vonNeumann machine after any
unfortunate "event".

Real DataFlow architectures do not even have the concept of a program-
counter/Instruction-pointer en-the-singular. Most of these have the
ability to express massive parallelism, only to immediately sucumb to
the explosion in resource requirements needed by the exposing the
inherent parallelism.

Plus, its not quite clear how one deals, progamatically, with the fact
that time really is relative--that is one can not determine what came
first by reading a (ahem) reference clock and comparing readings. It
is also clear that when 1 million/billion threads are trying to decide
which one is first, that the fact of needing to decide who is first is
an implausible model for real computations.

Mitch

nm...@cam.ac.uk

unread,

Nov 10, 2010, 2:53:36 PM11/10/10

to

In article <71279bbe-b0e7-498a...@j9g2000vbl.googlegroups.com>,

MitchAlsup <Mitch...@aol.com> wrote:
>
>Real DataFlow architectures do not even have the concept of a program-
>counter/Instruction-pointer en-the-singular. Most of these have the
>ability to express massive parallelism, only to immediately sucumb to
>the explosion in resource requirements needed by the exposing the
>inherent parallelism.

Not necessarily, though I agree that considerable attention needs to
be paid to that area to stop it happening! And I don't mean just
in the execution - I mean in the constraints on what programs are
acceptable.

>Plus, its not quite clear how one deals, progamatically, with the fact
>that time really is relative--that is one can not determine what came
>first by reading a (ahem) reference clock and comparing readings. It
>is also clear that when 1 million/billion threads are trying to decide
>which one is first, that the fact of needing to decide who is first is
>an implausible model for real computations.

Well, yes. But that's life. Anyone who thinks that they can get
away with a single timeframe is living in the 20th century ....

Regards,
Nick Maclaren.

Robert Myers

unread,

Nov 10, 2010, 3:34:02 PM11/10/10

to

On Nov 10, 2:47 pm, MitchAlsup <MitchAl...@aol.com> wrote:

>
> Real DataFlow architectures do not even have the concept of a program-
> counter/Instruction-pointer en-the-singular. Most of these have the
> ability to express massive parallelism, only to immediately sucumb to
> the explosion in resource requirements needed by the exposing the
> inherent parallelism.
>

Getting a granular substance from a large reservoir to empty into a
finite channel is challenging. Maybe it's my time at UIUC that makes
me think of corn and soybeans. Everybody wants to be first, and
that's the problem.

The same problem reappears in so many different contexts. I encounter
it nearly every time I drive into Boston. New Hampshire is an
inexhaustible reservoir of "me first," maybe because they've already
come such a long way. The distribution of the location of the backups
(like the spinning disk drive) is sharply peaked, but the timing
isn't. In any case, no amount of concrete will make the problem
disappear.

> Plus, its not quite clear how one deals, progamatically, with the fact
> that time really is relative--that is one can not determine what came
> first by reading a (ahem) reference clock and comparing readings. It
> is also clear that when 1 million/billion threads are trying to decide
> which one is first, that the fact of needing to decide who is first is
> an implausible model for real computations.
>

Maybe we'll get it when we figure out how the brain solves its own
staggering non-simultaneity problem. The brain does do it, albeit
imperfectly, but it muddles through--somehow.

Robert.

EricP

unread,

Nov 11, 2010, 12:25:33 PM11/11/10

to

MitchAlsup wrote:
>
> Real DataFlow architectures do not even have the concept of a program-
> counter/Instruction-pointer en-the-singular. Most of these have the
> ability to express massive parallelism, only to immediately sucumb to
> the explosion in resource requirements needed by the exposing the
> inherent parallelism.
>
> Plus, its not quite clear how one deals, progamatically, with the fact
> that time really is relative--that is one can not determine what came
> first by reading a (ahem) reference clock and comparing readings. It
> is also clear that when 1 million/billion threads are trying to decide
> which one is first, that the fact of needing to decide who is first is
> an implausible model for real computations.
>
> Mitch

It would behave much like an analog computer -
when an input changes the consequences ripple through the equations
and the system would "ring" for a while due to propagation delays
before setting down to a new stable state.

Feedback would oscillate so it is probably not what you want.
Without feedback, the system should eventually reach a new stable state.

Every item that has triggered is 'first'.
If there is no feedback, would the order items are processed matter
(ignoring floating point issues like rounding)?

Eric

EricP

unread,

Nov 11, 2010, 1:07:46 PM11/11/10

to

MitchAlsup wrote:
>
> Plus, its not quite clear how one deals, progamatically, with the fact
> that time really is relative--that is one can not determine what came
> first by reading a (ahem) reference clock and comparing readings. It
> is also clear that when 1 million/billion threads are trying to decide
> which one is first, that the fact of needing to decide who is first is
> an implausible model for real computations.
>
> Mitch

Though for efficiency sake we would want some way to explicitly
control scheduling when we know multiple variables change together.
For example:

x = a + b + c

If we know that a, b and c all change together then there is no
point in evaluating x and propagating its change twice.
So there would need to be some concept of "blocks"
of variables that are considered to all change together.

Conditionals could be done something like:

if (foo < 0) d = 1;
x = (a + b + c) * d;

So d gets set only if (foo < 0), and the change in d
triggers the reevaluation of x.

Eric

Robert Myers

unread,

Nov 11, 2010, 2:35:40 PM11/11/10

to

Anyone have any idea where all these non-trivial parallel workloads
dying to be done in parallel actually reside?

Robert.

Andy "Krazy" Glew

unread,

Nov 11, 2010, 6:59:01 PM11/11/10

to

On 11/10/2010 12:34 PM, Robert Myers wrote:
> On Nov 10, 2:47 pm, MitchAlsup<MitchAl...@aol.com> wrote:
>
>>
>> Real DataFlow architectures do not even have the concept of a program-
>> counter/Instruction-pointer en-the-singular. Most of these have the
>> ability to express massive parallelism, only to immediately sucumb to
>> the explosion in resource requirements needed by the exposing the
>> inherent parallelism.
>>
>
> Getting a granular substance from a large reservoir to empty into a
> finite channel is challenging. Maybe it's my time at UIUC that makes
> me think of corn and soybeans. Everybody wants to be first, and
> that's the problem.

When I gave my MLP talk at the first WACI session at ASPLOS, Nikhil, one
of the MIT dataflow guys, a student of Arvinds, got up and said that the
advantage of micro-dataflow implementing the Von Neuman model was that
the instruction pointer gave an indication of what computation had
higher priority.

> Maybe we'll get it when we figure out how the brain solves its own
> staggering non-simultaneity problem. The brain does do it, albeit
> imperfectly, but it muddles through--somehow.

I sometimes wondffer if sequential consciousness is not a resource
prioritization trick in the brain, much as Nikhil talked about it being
for computation.

Robert Myers

unread,

Nov 11, 2010, 7:49:11 PM11/11/10

to

On Nov 11, 6:59 pm, "Andy \"Krazy\" Glew" <a...@SPAM.comp-arch.net>
wrote:

> I sometimes wondffer if sequential consciousness is not a resource

> prioritization trick in the brain, much as Nikhil talked about it being
> for computation.

If only we understood the prefrontal cortex...

Of course, solving *that* problem would have implications far beyond
computer architecture.

Robert.

MitchAlsup

unread,

Nov 11, 2010, 9:57:05 PM11/11/10

to

On Nov 11, 5:59 pm, "Andy \"Krazy\" Glew" <a...@SPAM.comp-arch.net>
wrote:

I think the problem is much more fundamental.

The brain (or any analog system) computes what it is programmed to
compute based on the data it accepts at the beginning of the
computation and emits whatever the computation results happen to be to
the rest of the system. No real synchronization takes place, and lots
of "does this make sense" sideband calculations are used to qualify
the utility of the computation. Every data point is continuously
available to any computation consumer. Any data point can be updated
at any point in time, and the computation continues to converge to the
"right" result. The time-constant for the brain to converge needs to
be on-the-order of a blink of an eye, or certain "oh crap" high
priority interrupt go off and easier to perform computations given
precidence {inate and trained responses}.

That is: Continuous computation, without synchronization, with
(nearly) guarenteed convergence, and "save your but" safety
interrupts.

Procedural Programs do not behave like this. This is the vonNeumann
bottleneck wringing [y]our neck.

Mitch

Del Cecchi

unread,

Nov 11, 2010, 10:42:39 PM11/11/10

to

"EricP" <ThatWould...@thevillage.com> wrote in message
news:WGVCo.15651$Mk2....@newsfe13.iad...

Feedback does not necessarily cause or lead to oscillation. And I
don't know what you mean by "ringing" in an analog computer. If you
mean that it has a finite bandwidth, sure. That is normally implicit
in the differential equations being implemented.

Andrew Reilly

unread,

Nov 12, 2010, 2:35:38 AM11/12/10

to

On Thu, 11 Nov 2010 18:57:05 -0800, MitchAlsup wrote:

> That is: Continuous computation, without synchronization, with (nearly)
> guarenteed convergence, and "save your but" safety interrupts.

There is a modicum of synchronization: most sensory inputs seem to be
synchronised by real time. Computation doesn't get to run on until it
gets the right answer, the system copes with whatever state it has when
the next thing happens.

Cheers,

--
Andrew

Ken Hagan

unread,

Nov 12, 2010, 7:53:07 AM11/12/10

to

On Fri, 12 Nov 2010 07:35:38 -0000, Andrew Reilly
<areil...@bigpond.net.au> wrote:

> There is a modicum of synchronization: most sensory inputs seem to be
> synchronised by real time. Computation doesn't get to run on until it
> gets the right answer, the system copes with whatever state it has when
> the next thing happens.

Depending on what consciousness is, that could be a whole lot more than a
modicum. There is only one reality out there, so it is perhaps
understandable that our brain (normally) produces only one sense of
consciousness in response to it.

But I don't even know the problem that my brain solves. Maybe if I knew
that, I'd be able to guess how it solved it.

EricP

unread,

Nov 12, 2010, 12:32:50 PM11/12/10

to

Del Cecchi wrote:
> "EricP" <ThatWould...@thevillage.com> wrote in message
> news:WGVCo.15651$Mk2....@newsfe13.iad...

>> It would behave much like an analog computer -
>> when an input changes the consequences ripple through the equations
>> and the system would "ring" for a while due to propagation delays
>> before setting down to a new stable state.
>>
>> Feedback would oscillate so it is probably not what you want.
>> Without feedback, the system should eventually reach a new stable
>> state.
>>
>> Every item that has triggered is 'first'.
>> If there is no feedback, would the order items are processed matter
>> (ignoring floating point issues like rounding)?
>>
>> Eric
>>
>
> Feedback does not necessarily cause or lead to oscillation. And I
> don't know what you mean by "ringing" in an analog computer. If you
> mean that it has a finite bandwidth, sure. That is normally implicit
> in the differential equations being implemented.

Yes, it doesn't necessarily imply oscillation or growth to overflow,
and used judiciously it could be useful to iterate on a problem
until "done" - say a Newton-Raphson.
I just found the idea that a data flow architecture implies accidental
program oscillation as a potential new class of bug to be worth noting.

By ringing I mean the value would bounce around before setting down
due to propagation delays along the various paths of the program.
Granted in an simple analog program with a few 10's of calculations
such an effect would probably only be a nanosecond long.

But in a digital data flow program I think such a bouncing around
effect would be quite noticeable.

x = a + b + c

If an input change results in a, b, and c all changing but with different
propagation delays, then x could be calculated 3 times before it settles down
(depending on the different delays .vs. the queue latency to recalculate x).
And each change to x will trigger its dependents to recalculate.

Eric

Robert Myers

unread,

Nov 12, 2010, 12:45:57 PM11/12/10

to

On Nov 12, 7:53 am, "Ken Hagan" <K.Ha...@thermoteknix.com> wrote:
> On Fri, 12 Nov 2010 07:35:38 -0000, Andrew Reilly
>

> <areilly...@bigpond.net.au> wrote:
> > There is a modicum of synchronization: most sensory inputs seem to be
> > synchronised by real time. Computation doesn't get to run on until it
> > gets the right answer, the system copes with whatever state it has when
> > the next thing happens.
>
> Depending on what consciousness is, that could be a whole lot more than a
> modicum. There is only one reality out there, so it is perhaps
> understandable that our brain (normally) produces only one sense of
> consciousness in response to it.
>

So long as we are being careful about consciousness--a slippery
concept--it probably wouldn't hurt to be careful about reality, which,
as we experience it, is a mental construct and may "in reality" be all
kinds of weird things we haven't thought of (mentally constructed)
yet. I don't want to turn this into a thread of wild speculation, but
it would probably safest to stick with "only one reality out there" as
a convenient working hypothesis, rather than as demonstrated fact (or
even as a statement with a definite meaning).

> But I don't even know the problem that my brain solves. Maybe if I knew
> that, I'd be able to guess how it solved it.

Users of opioids or similar drugs that flood reward pathways with
dopamine usually aren't left with the feeling that there are any other
important problems left to solve. "Maximize dopamine release into
reward pathways" as a working goal for the brain is probably no more
dicey than "only one reality out there." Even if that were all there
were to the story, of course, it would only push the mystery back to
the ways in which the brain manipulates reward pathways without the
use of drugs.

What problem is an operating system solving, anyway?

I fully expect neurophysiology to get to the how of parallel
computation long before computer science does.

Robert.

Del Cecchi

unread,

Nov 12, 2010, 11:04:53 PM11/12/10

to

"EricP" <ThatWould...@thevillage.com> wrote in message

news:jUeDo.15775$AT2....@newsfe01.iad...

Analog computers aren't really like that. Typically there are
bandwidth considerations similar to propagation delays but there are
also integrators etc in addition to summers. It has been quite a
while since analog computer class and lab.
>

Terje Mathisen

unread,

Nov 13, 2010, 6:46:21 AM11/13/10

to

EricP wrote:
> Yes, it doesn't necessarily imply oscillation or growth to overflow,
> and used judiciously it could be useful to iterate on a problem
> until "done" - say a Newton-Raphson.
> I just found the idea that a data flow architecture implies accidental
> program oscillation as a potential new class of bug to be worth noting.
>
> By ringing I mean the value would bounce around before setting down
> due to propagation delays along the various paths of the program.

Ringing is _very_ common in NR iterations used for function evaluation,
this is one of the key reasons why you would rather have a fixed number
of iterations than a detector which notices that the algorithm is done.

Quadibloc

unread,

Nov 16, 2010, 7:45:09 PM11/16/10

to

On Nov 9, 2:25 pm, Robert Myers <rbmyers...@gmail.com> wrote:

> Many, many developments in mathematical physics slide over a key step
> that boils down to: "chaotic, non-linear systems forget the past and
> become random." Suppose "chaotic," non-linear systems always remember
> the past in really important ways?

Classical non-linear systems certainly do remember the past. It's just
that we can't measure the weather in enough detail to tell what the
effect of a butterfly flapping its wings will be, and so randomness
represents the best approximate simplified model that we can use in
practice.

The dream of predicting the future by knowing the position and
velocity of every particle at a given time to infinite precision...
just will not be realized. Our supercomputers will have plenty of work
to do without attempting the impossible.

Yes, a very fast sequential processor is superior to a lot of parallel
ones, but that is because it is more general in what it can do; how
the universe works physically isn't really the reason for this.

John Savard

Robert Myers

unread,

Nov 16, 2010, 8:52:00 PM11/16/10

to

That's nice. Could you possibly do me a favor and look at this
article

http://workspace.imperial.ac.uk/mathsinstitute/public/research/programmes/turbulence/marie_curie_chair/P4.pdf

which I promise you does not take you for an idiot and will not try to
tell you things you already know.

Robert.

Terje Mathisen

unread,

Nov 17, 2010, 2:50:39 AM11/17/10

to

Robert Myers wrote:
> That's nice. Could you possibly do me a favor and look at this
> article
>
> http://workspace.imperial.ac.uk/mathsinstitute/public/research/programmes/turbulence/marie_curie_chair/P4.pdf
>
> which I promise you does not take you for an idiot and will not try to
> tell you things you already know.

What I got from that paper is that many experiments seem to validate the
theory of ergodicity for tubulent flows, but only at particular scales.

The huge benefit if it is true is that it makes simulation almost
infinitely easier, right?

It is also known that this does break down when the scale changes, so
what happens on/near the boundary?

Quadibloc

unread,

Nov 17, 2010, 9:45:33 AM11/17/10

to

On Nov 16, 6:52 pm, Robert Myers <rbmyers...@gmail.com> wrote:

> That's nice. Could you possibly do me a favor and look at this
> article

> which I promise you does not take you for an idiot and will not try to

> tell you things you already know.

The PowerPoint presentation on which the PDF is based... has somewhat
distracting typography. Of course, _that_ is irrelevant, but it is
going to affect the gentleman's chances of being taken seriously.

It certainly is true that some chaotic systems don't just end up in a
homogenous "random" state, but instead break symmetry in a dramatic
manner. The presentation gives bifurcated flows as an example; the
classic "butterfly effect" example speaks of a hurricane happening
here rather than there.

Thus, for a system to be ergodic is a much stricter condition than is
required for the conclusion that however desirable it might be to be
able to predict its future behavior, it is still impractical to do so.
When tiny causes lead to big effects, the relevant part of the system
where the tiny cause happened may be approximately ergodic for some
time before the apparently random results of that cause now influence
larger-scale events that are approximately deterministic. It is
precisely because the systems involved are _nonlinear_ that this
transition from an approximately ergodic domain to a not ergodic at
all domain can happen.

It seems to me, therefore, that the article you've cited is based on a
misconception, which is why it's attempting to take facts that
"everybody knows" and turn them into what would be in effect a
refutation of chaos theory.

We know that chaotic systems are not ergodic (if nothing interesting
could happen, they wouldn't be chaotic) but the reason that they are
classed as chaotic is because the nonlinearity has caused coupling
between the large-scale obviously non-ergodic domain, and the small-
scale domain that is so close to being ergodic that attempting to
predict the large-scale consequences of effects on that small scale
would be completely impractical.

We can predict the weather a day or two ahead... but that doesn't
prove that we can also predict it decades into the future. He seems to
be attempting to say, "See, you can predict tomorrow's weather!
Therefore, the atmosphere is not ergodic. Thus, predicting the weather
a decade from now is not necessarily impossible". I'm afraid that
chaos theorists are just going to look at him funny.

John Savard

nm...@cam.ac.uk

unread,

Nov 17, 2010, 11:59:37 AM11/17/10

to

In article <11817fe9-a947-41b7...@k11g2000vbf.googlegroups.com>,

Quadibloc <jsa...@ecn.ab.ca> wrote:
>
>Thus, for a system to be ergodic is a much stricter condition than is
>required for the conclusion that however desirable it might be to be
>able to predict its future behavior, it is still impractical to do so.

That is very true.

>We know that chaotic systems are not ergodic (if nothing interesting
>could happen, they wouldn't be chaotic) but the reason that they are
>classed as chaotic is because the nonlinearity has caused coupling
>between the large-scale obviously non-ergodic domain, and the small-
>scale domain that is so close to being ergodic that attempting to
>predict the large-scale consequences of effects on that small scale
>would be completely impractical.

Er, in the absence of a decent mathematical definition of "chaotic",
I think that's also incorrect. There is no obvious reason why many
chaotic systems cannot be ergodic at sufficiently large scales (or
asymptotically ergodic, if your prefer). The weather is precisely
an example of where this is true (in the absence of global warming,
yada, yada, ....)

It isn't hard to produce examples of systems that are not ergodic
at small scales but are ergodic at larger ones! At this point, you
have to consider what the nature of time is in your model, and the
one thing that I know about continuous time Markov theory is that I
am not capable of understanding it :-)

However, I am not refuting the general thrust of your posting, which
makes sense to me. I haven't read that paper and don't intend to,
so shall not comment on it.

Regards,
Nick Maclaren.