Initial question

William Pearson

unread,

Jul 18, 2010, 2:24:14 PM7/18/10

to Future Computer Architecture - Technology

Hi,

I feel we are losing track of what I wanted to talk about. Imagine
memristors [1] work and that we can embed them into processors and get
persistent memory right on chip.

What kind of architecture would you build? Would it only be
programmable by another chip or would it be able to alter its own
state so that it could alter the program it instantiated?

There are many people working with networks of compute nodes that are
standard processors/cores with large amounts of off-CPU memory that
controls the program. I'm interested in things that are a little more
esoteric.

Will

[1] http://www.memristor.org/

Todd Hoff

unread,

Jul 18, 2010, 2:36:55 PM7/18/10

to fc...@googlegroups.com

Something like APL or dataflow languages targeting a memristor substrate? In my mind I see see arrays of data where each cell is connected to a series of transforms. The transforms can update in place or create new arrays. Each of these new arrays will have their own transforms that all run asynchronously. Transforms can be programmed on the fly. By copying the data and the code it seems plausible to have maximum parallelization. No locks and the CPU blocks can fire at will.

J. Andrew Rogers

unread,

Jul 19, 2010, 4:15:55 AM7/19/10

to fc...@googlegroups.com

On Sun, Jul 18, 2010 at 11:24 AM, William Pearson <wil.p...@gmail.com> wrote:

I feel we are losing track of what I wanted to talk about. Imagine
memristors [1] work and that we can embed them into processors and get
persistent memory right on chip.

How would memristor memory systems be materially different from massively multithreaded systems at the level of computational model? Software implementations on these systems frequently never go to proper persistent storage, have huge memories per core, and have extremely flat models. It is distributed, but that aspect is never visible from the code in the systems I am familiar with. The code has a vaguely functional feel.

What kind of architecture would you build? Would it only be
programmable by another chip or would it be able to alter its own
state so that it could alter the program it instantiated?

Honestly, I would not change too much from what is available now, maybe tweak some of the design elements and update certain aspects of the cores. Also, I might extend the integer instructions to support multidimensional integer operands. Some functional programming constructs in the C++ would be nice e.g. continuations.

Just about anything you can imagine is available now if you know where to shop. Not memristors specifically but they are a relatively narrow (but cool) optimization. For a given silicon budget, everything is a tradeoff. Reconfigurable systems are interesting but are only narrowly competitive when faced with systems with a lot of different functional units connected by a fast fabric and a programmer that knows how to exploit them.

There are many people working with networks of compute nodes that are
standard processors/cores with large amounts of off-CPU memory that
controls the program. I'm interested in things that are a little more
esoteric.

Define "esoteric". Dynamically reconfigurable computing has its place, but it has practical drawbacks. I'm familiar with a number of systems that work like this and it is less compelling in practice than you might think.

--

J. Andrew Rogers

William Pearson

unread,

Jul 19, 2010, 7:24:19 AM7/19/10

to fc...@googlegroups.com

On 19 July 2010 09:15, J. Andrew Rogers <jar.m...@gmail.com> wrote:
>
> On Sun, Jul 18, 2010 at 11:24 AM, William Pearson <wil.p...@gmail.com>
> wrote:
>>
>> I feel we are losing track of what I wanted to talk about. Imagine
>> memristors [1] work and that we can embed them into processors and get
>> persistent memory right on chip.
>
>
> How would memristor memory systems be materially different from massively
> multithreaded systems at the level of computational model? Software
> implementations on these systems frequently never go to proper persistent
> storage, have huge memories per core, and have extremely flat models. It is
> distributed, but that aspect is never visible from the code in the systems I
> am familiar with. The code has a vaguely functional feel.

It would be nice if you referenced things somewhat. The closest I can
think of to what you are describing is Pig,
http://hadoop.apache.org/pig/. Although that does tend to go to
persistent storage.

The differences I can see are
1) You might want to be aware of how many (compute/memory) cells you
have left, so you can avoid having to swap around processes.
2) You wouldn't have to worry about cache in the same way.
http://lwn.net/Articles/252125/
3) As programs are persistent, you don't have to worry about the ELF
loader or equivalent getting taken over and trojaning all your code on
a reboot.
4) Real-time guarantees are easier to give.

I think we are coming at things from different directions. I'm
interested in things from a robotics/potentially hostile code
environment. You, I'm guessing, are coming from a clean and
controllable data centre/data processing point of view.

>>
>> What kind of architecture would you build? Would it only be
>> programmable by another chip or would it be able to alter its own
>> state so that it could alter the program it instantiated?
>
>
> Honestly, I would not change too much from what is available now, maybe
> tweak some of the design elements and update certain aspects of the cores.
> Also, I might extend the integer instructions to support multidimensional
> integer operands. Some functional programming constructs in the C++ would be
> nice e.g. continuations.

So we were lucky and we have found the processor architecture early on
in computers development that will last us forever and under all
conditions? People will still be using something similar 100 or a
1000 years in the future?

> Just about anything you can imagine is available now if you know where to
> shop. Not memristors specifically but they are a relatively narrow (but
> cool) optimization. For a given silicon budget, everything is a tradeoff.
> Reconfigurable systems are interesting but are only narrowly competitive
> when faced with systems with a lot of different functional units connected
> by a fast fabric and a programmer that knows how to exploit them.

Is that due to the fact that all our skills, knowledge and R&D is
concentrated in these types, or because they are intrinsically better?

>> There are many people working with networks of compute nodes that are
>> standard processors/cores with large amounts of off-CPU memory that
>> controls the program. I'm interested in things that are a little more
>> esoteric.
>
>
> Define "esoteric". Dynamically reconfigurable computing has its place, but
> it has practical drawbacks. I'm familiar with a number of systems that work
> like this and it is less compelling in practice than you might think.

Do our own neurons suffer from the same practical drawbacks? If not
can we mimic what they do differently, while keeping the
understandability of silicon? I don't think that the space of
reconfigurable computing has been fully explored, so I would like to
see what can be done.

Will

William Pearson

unread,

Jul 19, 2010, 7:44:56 AM7/19/10

to fc...@googlegroups.com

On 18 July 2010 19:36, Todd Hoff <toddho...@gmail.com> wrote:
> Something like APL or dataflow languages targeting a memristor substrate? In
> my mind I see see arrays of data where each cell is connected to a series
> of transforms. The transforms can update in place or create new arrays.
> Each of these new arrays will have their own transforms that all run
> asynchronously. Transforms can be programmed on the fly. By copying the data
> and the code it seems plausible to have maximum parallelization. No locks
> and the CPU blocks can fire at will.
>

I do like the dataflow style languages. They abstract away a bit too
much for my liking (at least in their current forms) for system level
programming.

For example. how would you reference transforms in such a way that
they can be (re)programmed in a dataflow language?

Will

J. Andrew Rogers

unread,

Jul 19, 2010, 4:07:59 PM7/19/10

to fc...@googlegroups.com

On Mon, Jul 19, 2010 at 4:24 AM, William Pearson <wil.p...@gmail.com> wrote:

It would be nice if you referenced things somewhat. The closest I can
think of to what you are describing is Pig,
http://hadoop.apache.org/pig/. Although that does tend to go to
persistent storage.

No, not like that. I'm thinking more along the lines of e.g. Cray XMT processors. Single processes are decomposed into thousands or millions of lightweight threads, some statically and some dynamically. No visible messaging, the memory model in the cluster is completely flat. There is no cache or cache locality at all, by design.

You do not "see" threads in XMT code, just a fairly vanilla UNIX process. There is no conventional synchronization (since no threads). Memory addresses have bit flags; if you really care, access can be qualified with a "test and set" macro based on the flag state. If you signal exclusion on a flag when you access a memory address, a blocked tasklet sleeps until the bit flag on the word is changed. Since task switching is single-cycle, access is atomic, and there is no spinning, synchronization is approximately free. If you can decompose a process into enough runnable tasklets, the cores are never waiting for anything even though tasks are being scattered hither and yon in the distributed system.

Basically, it is a CPU optimized for extremely fine-grained parallelism in a distributed memory environment at the expense of efficiency at cache-local sequential processing. There are some other new processor designs based on the same ideas (deep latency hiding). These architectures are extremely efficient for graph and data flow models relative to more conventional processors.

So we were lucky and we have found the processor architecture early on
in computers development that will last us forever and under all
conditions? People will still be using something similar 100 or a
1000 years in the future?

Architectures are a pretty straightforward set of tradeoffs. Pick your tradeoffs and find the processor design that matches. There are myriad processor designs out there that assume all manner of use cases. You would be hard-pressed to come up with a use case for which a processor does not exist. You could argue about the mix-and-match of functional units, but that is a design detail.

Most *popular* architectures assume a high degree of cache locality. This is ideal for some codes, very bad for others. Intel's hyper-threading, to use that example, is a baby step toward latency-oblivious architectures like the XMT processor I mention above. Latency-oblivious architectures give poor performance for other types of codes (e.g. tight cache-local algorithms) so it is not all upside.

> Just about anything you can imagine is available now if you know where to
> shop. Not memristors specifically but they are a relatively narrow (but
> cool) optimization. For a given silicon budget, everything is a tradeoff.
> Reconfigurable systems are interesting but are only narrowly competitive
> when faced with systems with a lot of different functional units connected
> by a fast fabric and a programmer that knows how to exploit them.

Is that due to the fact that all our skills, knowledge and R&D is
concentrated in these types, or because they are intrinsically better?

There is no "intrinsically better". It is just a set of tradeoffs. The phase space has been pretty thoroughly gone over multiple times.

Do our own neurons suffer from the same practical drawbacks? If not
can we mimic what they do differently, while keeping the
understandability of silicon? I don't think that the space of
reconfigurable computing has been fully explored, so I would like to
see what can be done.

Neurons are really just a graph-like computation model, hence why I mentioned pervasively parallel latency-oblivious CPUs above. With some of the weird topological computational models, you can even get adequate performance on conventional processors.

Reconfigurable computing is for when the actual bandwidth to the processing units significantly exceeds the throughput of a conventional CPU. That is the only real use case. Most interesting algorithms are latency-bound or at least are not limited by the throughput of the core pipeline per se, which means a conventional processor will work just as well. Or even better, use a conventional latency-oblivious processor if there is some exploitable parallelism.

--

J. Andrew Rogers

William Pearson

unread,

Jul 22, 2010, 6:11:14 AM7/22/10

to fc...@googlegroups.com

On 19 July 2010 21:07, J. Andrew Rogers <jar.m...@gmail.com> wrote:
> On Mon, Jul 19, 2010 at 4:24 AM, William Pearson <wil.p...@gmail.com>
> wrote:

> Basically, it is a CPU optimized for extremely fine-grained parallelism in a
> distributed memory environment at the expense of efficiency at cache-local
> sequential processing. There are some other new processor designs based on
> the same ideas (deep latency hiding). These architectures are extremely
> efficient for graph and data flow models relative to more conventional
> processors.

Thanks, interesting.

I'm a bit busy at the moment. I'll get back to you in a bit. Just to
say that I suspect that my search for a better paradigm is mainly due
to my interest in security. I want to do things like avoiding single
points of failure. Yup you can simulate that, but if you can are
always using the same security system it would be better for it to be
embodied in the system architecture for efficient.

These things can be safely ignored by super computer designers for the
most part :) And so add an extra variable in the phase space of
systems I am considering.

Will

J. Andrew Rogers

unread,

Jul 23, 2010, 4:13:01 AM7/23/10

to fc...@googlegroups.com

On Thu, Jul 22, 2010 at 3:11 AM, William Pearson <wil.p...@gmail.com> wrote:

I'm a bit busy at the moment. I'll get back to you in a bit. Just to
say that I suspect that my search for a better paradigm is mainly due
to my interest in security. I want to do things like avoiding single
points of failure. Yup you can simulate that, but if you can are
always using the same security system it would be better for it to be
embodied in the system architecture for efficient.

These things can be safely ignored by super computer designers for the
most part :) And so add an extra variable in the phase space of
systems I am considering.

Eliminating single points of failure is a big deal with supercomputing systems. When you start pushing a million compute elements failures tend to be routine. It is not a big problem, just select an appropriate level of redundancy and/or redo for the cluster.

Security is almost an orthogonal matter. Hardened network fabric technologies exist but they are not exactly low latency if you are using them as a distributed computing fabric. At the level of an individual compute node, security granularity is expensive and it depends on how decentralized you want your authority. Pervasively decentralized authority is a Hard Problem.

H-Base-like architectures are coming into vogue (one process per core, no task switching or locks) so the idea is to put the security at the logical node/physical core level. That is almost, but not quite, an impedance match for relying on hardened network fabrics and whatever hardened kernels+hardware they can come up with to provide security. Inside these "nodes" it is typically a brute-force single-threaded process for maximum efficiency; it makes a certain amount of sense. You just need small enough nodes to retire operations at a reasonable rate.

--

J. Andrew Rogers

William Pearson

unread,

Jul 31, 2010, 6:04:00 AM7/31/10

to fc...@googlegroups.com

Sorry for the late reply.

On 23 July 2010 09:13, J. Andrew Rogers <jar.m...@gmail.com> wrote:
> wrote:
>

> Eliminating single points of failure is a big deal with supercomputing
> systems. When you start pushing a million compute elements failures tend to
> be routine. It is not a big problem, just select an appropriate level of
> redundancy and/or redo for the cluster.

I meant single point of failure in the sense of if a malicious bit of
code got control of a single resource, the system would go down of be
highly adversely affected. So no flat memory models, no
uninterruptable process, no Ring 0.

> Security is almost an orthogonal matter. Hardened network fabric
> technologies exist but they are not exactly low latency if you are using
> them as a distributed computing fabric. At the level of an individual
> compute node, security granularity is expensive and it depends on how
> decentralized you want your authority. Pervasively decentralized authority
> is a Hard Problem.

What is your opinion on the old capability model?

http://en.wikipedia.org/wiki/Capability-based_security

It seems a good match for decentralized authority.

Will

J. Andrew Rogers

unread,

Aug 1, 2010, 4:08:45 AM8/1/10

to fc...@googlegroups.com

On Sat, Jul 31, 2010 at 3:04 AM, William Pearson <wil.p...@gmail.com> wrote:

I meant single point of failure in the sense of if a malicious bit of
code got control of a single resource, the system would go down of be
highly adversely affected. So no flat memory models, no
uninterruptable process, no Ring 0.

I am not clear on what flat memory models have to do with it. The term "flat" is an observation about latency, not much else.

What is your opinion on the old capability model?

http://en.wikipedia.org/wiki/Capability-based_security

It seems a good match for decentralized authority.

Capabilities are great. The caveat is that they are expensive in a distributed environment and in the case of a pervasively decentralized network assume no bad actors.

A very, very big problem with pervasively decentralized authority is Nash equilibria. The strategies available that are required for a pervasively decentralized authority to work are incredibly brittle; deviate from those strategies even a little bit and you will end up with pathological or effectively centralized behaviors in systems where the user may nonetheless be benign. Guaranteeing a stable and efficient equilibria in a pervasively decentralized system is a problem we only know how to solve under very strict constraints that do not apply to many applications. All of the strategies are generalized variants of Marginal Cost Pricing strategies for transactions. These work reasonably well in simple systems.

This is an extremely hard theoretical problem. As far as I know, it has never been solved either officially or unofficially by anyone. Truly decentralized authority is an incredibly difficult problem to solve; I am not certain it is even solvable. There may be a clever solution but it is much harder than some other problems. I studied the problem for a while and decided there were easier problems to attack. :-)

--

J. Andrew Rogers

William Pearson

unread,

Aug 1, 2010, 5:47:49 AM8/1/10

to fc...@googlegroups.com

On 1 August 2010 09:08, J. Andrew Rogers <jar.m...@gmail.com> wrote:
> On Sat, Jul 31, 2010 at 3:04 AM, William Pearson <wil.p...@gmail.com>
> wrote:
>>
>> I meant single point of failure in the sense of if a malicious bit of
>> code got control of a single resource, the system would go down of be
>> highly adversely affected. So no flat memory models, no
>> uninterruptable process, no Ring 0.
>
>
> I am not clear on what flat memory models have to do with it. The term
> "flat" is an observation about latency, not much else.

I was going with the wikipedia definition

"Flat memory model or linear memory model refers to a memory
addressing paradigm in low-level software design such that the CPU can
directly (and sequentially/linearly) address all of the available
memory locations without having to resort to any sort of memory
segmentation or paging schemes."

I suspect it is one of these things that has multiple usages.

>> What is your opinion on the old capability model?
>>
>> http://en.wikipedia.org/wiki/Capability-based_security
>>
>> It seems a good match for decentralized authority.
>
>
> Capabilities are great. The caveat is that they are expensive in a
> distributed environment and in the case of a pervasively decentralized
> network assume no bad actors.

Yeah I don't tend to think too distributed (e.g the capabilities are
still opaque to the system).

> A very, very big problem with pervasively decentralized authority is Nash
> equilibria. The strategies available that are required for a pervasively
> decentralized authority to work are incredibly brittle; deviate from those
> strategies even a little bit and you will end up with pathological or
> effectively centralized behaviors in systems where the user may nonetheless
> be benign. Guaranteeing a stable and efficient equilibria in a pervasively
> decentralized system is a problem we only know how to solve under very
> strict constraints that do not apply to many applications.

Interesting I didn't know there was work along those lines. I'll have
to read more on the subject.

Will

J. Andrew Rogers

unread,

Aug 1, 2010, 1:05:10 PM8/1/10

to fc...@googlegroups.com

On Sun, Aug 1, 2010 at 2:47 AM, William Pearson <wil.p...@gmail.com> wrote:

I was going with the wikipedia definition

"Flat memory model or linear memory model refers to a memory
addressing paradigm in low-level software design such that the CPU can
directly (and sequentially/linearly) address all of the available
memory locations without having to resort to any sort of memory
segmentation or paging schemes."

I suspect it is one of these things that has multiple usages.

It is a distinction of interface versus implementation. Segmentation can be implemented both visibly and invisibly at various software levels. If you ignore addressing, which can always be mapped to something "flat" in software, the flatness of memory is primarily about latency. When you access a memory address there are many things that can alter the apparent latency such as having been paged to disk or being attached to a non-local core in a NUMA system. In a flat memory model, every address is the same distance from the core from a software design standpoint.

This is particularly important in large shared memory systems. In cache-oblivious architectures, the address space really does feel flat which makes software design simpler. If the architecture was designed for cache-local codes (e.g. SGI Altix), the large variance in apparent memory distance makes for much more complicated code even though the address space is technically flat. It is like designing software around large memory-mapped files except that you have to deal with it for purely in-memory data structures. Good performance requires taking explicit control of memory layout behaviors. Clusters are somewhat easier to program for in this regard because the memory boundaries and distances are unambiguous -- neither you, nor the compiler, nor the operating system can mismanage it -- with the cost that you have to deal with the network explicitly.

> A very, very big problem with pervasively decentralized authority is Nash
> equilibria. The strategies available that are required for a pervasively
> decentralized authority to work are incredibly brittle; deviate from those
> strategies even a little bit and you will end up with pathological or
> effectively centralized behaviors in systems where the user may nonetheless
> be benign. Guaranteeing a stable and efficient equilibria in a pervasively
> decentralized system is a problem we only know how to solve under very
> strict constraints that do not apply to many applications.

Interesting I didn't know there was work along those lines. I'll have
to read more on the subject.

It is algorithmic game theory. I do not see much written about it in the software world, probably because the mathematics is pretty difficult.

I became exposed to this set of issues about three years ago working on some international projects. One of the big problems for governments integrating their resources is that the authority has to be pervasively decentralized because they do not trust each other enough for a centralized authority to exist. If you really analyze this scenarios carefully, you quickly realize that every scheme has a centralized authority of some sort buried in it that is very difficult to eliminate while still retaining optimal characteristics for the overall system.