hardware support for Actors

Stephen Fuld

unread,

Jun 19, 2012, 5:17:06 PM6/19/12

to

I recently read the book Seven Languages in Seven Weeks. In the, they
make the point several times that the actor model is superior to the
shared memory model for multi-threaded applications.

In recent threads in this group, we have discussed hardware support for
the primitives required for shared memory multi=threading. So the
question arises, is there something that could reasonably be done to
improve support for actor model programs.

I know that in the mid 1980s, Elixsi had a system with hardware
(actually microcode) instructions for message passing. Would something
like that make sense now? Could it, with appropriate software support,
make actor model programs more prevalent and reduce the issues related
to shared memory programs.

I expect that Actor models are not a panacea, but they do seem an
improvement. Is the improvement enough that with some HW support, they
could help things considerably?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Edward Feustel

unread,

Jun 21, 2012, 6:43:49 AM6/21/12

to

On Tue, 19 Jun 2012 14:17:06 -0700, Stephen Fuld
<SF...@alumni.cmu.edu.invalid> wrote:

>I know that in the mid 1980s, Elixsi had a system with hardware
>(actually microcode) instructions for message passing. Would something
>like that make sense now? Could it, with appropriate software support,
>make actor model programs more prevalent and reduce the issues related
>to shared memory programs.
>

One might also want to look at the "simple mechanisms" for messaging
used on the Transputer.

Ed

MitchAlsup

unread,

Jun 21, 2012, 7:34:55 PM6/21/12

to SF...@alumni.cmu.edu.invalid

I have been giving some though to bringing back some support for base-and-bounds on-top-of paging (or first-byte/last-byte on-top-of paging) to enable
smaller than page-sixed messages, or just messages on odd boundaries in the paging tables.

{You would have a limit of (say) 128 of these things at a time (and Your Own Memory Space would have to consue a few.)}

What this means is that the virtual address would have to pass a base-and-bounds check before the address is translated through the page tables. Thus on could in theory, pass a 2 byte message starting on the last byte of one page and ending on the first byte of a second page. The receriver of said message would have access to both bytes of both pages, but no other bytes of either page. The OS/HyperVisor would be involved with setting up the paging parst of the message passing.

With 64-bit virtual addressses, and conventional 4KB pages, and 4KB page tables, you get 7 high order address bits that could be used to address a 4 DoubleWord container. In this container would be a root pointer, a base bit pattern, a limit bit pattern, and a DW of other stuff to make the message efficient. So, in essence, you get 128 messages of (essentially) unbounded size, and a 57-bit virtual address space for each message. {Tricks are used to make the paging table overheads "reasonable".}

Mitch

Andy (Super) Glew

unread,

Jun 22, 2012, 12:30:20 AM6/22/12

to

This is morphing from the IMHO excessively specific "hardware support
for Actors" to the more general "hardware support for message passing".

I think Mitch is in the world of designing efficient message passing
support in a shared memory system - where processors that are on
separate address spaces in a shared memory system can pass messages
efficiently, by arranging

a) for pages to be remapped from one process into the other

or possibly

b) having such processes be SAS, Single Address Space, sharing addresses
but not necessarily having access to each others pages. I.e. pages
mapped, but not p;ermitted.

Using the fine grain base+bounds protection to allow access to specific
parts of pages not otherwise allowed, parts passed by messages between
processes.

---

I'm okay with this. Indeed, I admit to having spent quite a lot of my
last decade looking at [lwb,upb) protection, a la Milo Martin et al's
HardBound.

---

However: non shared memopry message passing systems are *also* here to
stay. They are more common than shared memory.

My agenda, therefore, is

a) to have message passing instruction set extensions - send/receive
(and possibly also PGAS-like, MIP 3.x like, put, get, which are just
shared memory in another form).
I.e., yes, I propose instructions
put(channel, regval ) / get( channel, regval )
put( channel, buf, nbytes ) / get( channel, buf, nbytes )
+ possibly scatter/gather forms,
and/or RDMA forms
(and whatever is necessary to match MPI's matching semantics)

b) On non-shared memory but tightly bound systems, to interface these to
"channels" that match to the network switches' message passing

c) On loosely bound systems, to interface these to channels that talk
TCP/IP networking - ideally without OS intervantion, apart from the
initial setting up of the channel, assuming you have TOE (TCP/IP Offload
Engines).
Or at least to have an efficient way of putting stuff into OS
networking queues without system calls, only invoking the OS if it needs
to be woken up.

d) On shared memory systems, to use

d.1) efficient shared virtual memory if suitable

d.1a) passing pointers if the addresses are already shared

d.1b) remapping pages - in hardware/microcode, without or with minimal
OS intervention, if not already shared.

d.2) copying, in hardware or microcode, if virtual memory tricks are not
usable - e.g. if not page sized and aligned, and that is required.

The base+bounds stuff, aka "capabilities" stuff, is cool - but I think
is only the last stage in optimizing such a system.

Although it might be the first stage if you are thinking like a startup,
and not Intel.

Tim McCaffrey

unread,

Jun 25, 2012, 6:07:47 PM6/25/12

to

In article <4FE3F4DC...@SPAM.comp-arch.net>, an...@SPAM.comp-arch.net
says...

>
>
>This is morphing from the IMHO excessively specific "hardware support
>for Actors" to the more general "hardware support for message passing".
>

My opinion:

Going out of your way to use shared memory for messages may still be a win
for smaller clusters, but probably doesn't scale as well for large clusters.
Back in the days of MB/s bandwidth to memory you really wanted to minimize
the number of moves. These days with 10+GB/s bandwidth to memory the
tradeoffs are different.

The most efficient way to implement message passing is architecting
everything to allow a "push" only paradigm. If you require any "reads" (e.g.
acks, I/O setup, test for completion, or other feedback) your performance is
badly affected. Note that this isn't easy, and it is one of those things
that can't be done halfway (or even 99%, one "read" can ruin your
performance).

I would think things that make message passing more efficient would be things
like explicitly handing off cache line ownership, efficiently scheduling
tasks to handle messages, and off loading of message routing, etc.

- Tim

Stephen Fuld

unread,

Jun 26, 2012, 1:58:44 AM6/26/12

to

On 6/21/2012 9:30 PM, Andy (Super) Glew wrote:
> On 6/21/2012 4:34 PM, MitchAlsup wrote:
>> On Tuesday, June 19, 2012 4:17:06 PM UTC-5, Stephen Fuld wrote:

snip

> This is morphing from the IMHO excessively specific "hardware support
> for Actors" to the more general "hardware support for message passing".

I have no problem with that, as long as it doesn't get so far away that
simple actor models are not easily accommodated.

snip

> I'm okay with this. Indeed, I admit to having spent quite a lot of my
> last decade looking at [lwb,upb) protection, a la Milo Martin et al's
> HardBound.
>
> ---
>
> However: non shared memopry message passing systems are *also* here to
> stay. They are more common than shared memory.
>
> My agenda, therefore, is
>
> a) to have message passing instruction set extensions - send/receive
> (and possibly also PGAS-like, MIP 3.x like, put, get, which are just
> shared memory in another form).
> I.e., yes, I propose instructions
> put(channel, regval ) / get( channel, regval )
> put( channel, buf, nbytes ) / get( channel, buf, nbytes )
> + possibly scatter/gather forms,
> and/or RDMA forms
> (and whatever is necessary to match MPI's matching semantics)

I agree that message passing instructions are exactly the kind of thing
I was looking for. But I think it must be more than the instructions.
If you want the solution to be all in hardware, don't you need some
hardware data structures to insure that someone doesn't flood an
unprepared process with a huge number of messages, or that the message
goes into the area the receiver prepared for it, etc.? These may not be
necessary in the simplest case of messages within the multiple threads
of a single program, but beyond that, I think you need something.

snip

> The base+bounds stuff, aka "capabilities" stuff, is cool - but I think
> is only the last stage in optimizing such a system.
>
>
> Although it might be the first stage if you are thinking like a startup,
> and not Intel.

Well, I was thinking of Intel. With the future looking like more and
more cores per chip/system, it seem to be to Intel's advantage to do
things that make parallel programming easier/more efficient/more bug
free, etc. I proposed this in the, perhaps mistaken, belief that actor
based programs provided a better way forward and that Intel doing things
to encourage that route would be good for Intel.

The idea is that as a first step, it could provide better HW support for
the existing actor based languages. Since Intel also provides
compilers, A further step would be to provide an option for their C (and
perhaps Fortran) based compilers encourage actor based programs by
supporting send/receive directly and at the same time, make it harder
(at least with warnings, perhaps compile time errors) to do the things
in C that would hurt increased parallelism. This would allow people to
leverage their C skills, and parts of existing programs to develop more
scalable applications.

Andy (Super) Glew

unread,

Jun 26, 2012, 1:10:09 PM6/26/12

to

On 6/25/2012 10:58 PM, Stephen Fuld wrote:
> On 6/21/2012 9:30 PM, Andy (Super) Glew wrote:
>> On 6/21/2012 4:34 PM, MitchAlsup wrote:
>>> On Tuesday, June 19, 2012 4:17:06 PM UTC-5, Stephen Fuld wrote:
>
> snip
>
>
>> This is morphing from the IMHO excessively specific "hardware support
>> for Actors" to the more general "hardware support for message passing".
>
> I have no problem with that, as long as it doesn't get so far away that
> simple actor models are not easily accommodated.

Fair enough.

By the way: I have long been interested in Actors, but I have a terrible
memory for names. I sat next to Carl Hewitt for two meals at a
conference last year, and we talked about message passing - but he never
let drop the term "Actors". He did send me home with a paper of his, at
which point I realized.

>> a) to have message passing instruction set extensions - send/receive
>> (and possibly also PGAS-like, MIP 3.x like, put, get, which are just
>> shared memory in another form).
>> I.e., yes, I propose instructions
>> put(channel, regval ) / get( channel, regval )
>> put( channel, buf, nbytes ) / get( channel, buf, nbytes )
>> + possibly scatter/gather forms,
>> and/or RDMA forms
>> (and whatever is necessary to match MPI's matching semantics)
>
> I agree that message passing instructions are exactly the kind of thing
> I was looking for. But I think it must be more than the instructions.
> If you want the solution to be all in hardware, don't you need some
> hardware data structures to insure that someone doesn't flood an
> unprepared process with a huge number of messages, or that the message
> goes into the area the receiver prepared for it, etc.? These may not be
> necessary in the simplest case of messages within the multiple threads
> of a single program, but beyond that, I think you need something.

As you said: not necessary within a single privilege domain. But, yes,
we really do want message passing between privilege domains. Eventually.
The key is coming up with an evolutionary sequence of small incremental
changes, because requiring the whole thing be done at once scares people
off.

Being an architect, I do have a long term agenda that becomes as
complete as I can imagine. But sometimes even sharing that scares people
off.

Anyway

0) - now - we have message passing, e.g. through TCP/IP, and various
supercomputer-type interconnects. These go through software stacks on
the endpoints and in the routers. They have flood control to whatever
degree there is current flood control in TCP/IP, ertc.

1) Message passing instructions for same or compatible privilege
domains. Require software to set up, to check compatibility. Once
established, let the application manage its own flow and flood control.

1.1) Message passing when memory is shared... Can use, e.g. virtual
memory tricks.

1.2) Message passing when memory is not shared. Requires smart routers.
But, again, mostly exist.

2) Message passing instructions between privilege domains. E.g. between
machines using TCP/IP. Basically, interface to the NIC as directly as
possible. Flow/flood control can be done by existing mechanisms in the
NIC and routers. They may need to be augmented for direct access - but
Intel already has direct access to networking, just via shared memory
queues. So, largely, can leverage existing mechanisms.

Largely, my approach is to let software do the hard parts of protection
- managing flow control, preventing DOS. Let hardware do the easy parts
- preventing access out of bounds.

Eventually may want hardware to provide assists to software - e.g.
interrupts when too many packets hit a receiver in a time interval. But
I am guessing that we would be immensely lucky to have gotten to the
point where that is a problem. Or, perhaps, lack of that is what has
been holding architectural message passing back - but I doubt it.

Some asides:

* It is greatly convenient to require software to set up a handle, a
channel, for communication, and then have hardware/firmware operate on
the data in the channel. Unfortunately, MPI's pattern matching
requires better than that - it is impractical to establish N^2 channels
for communication between N nodes. Let alone multicast. What I don't
like is that MPI's pattern matching seems specific.

* Architectural message passing, in the ISA, is squeezed between shared
memory on the one hand, and all of the message passing implementations
that allow direct access to network queues without a syscall.

Stephen Fuld

unread,

Jun 26, 2012, 9:39:21 PM6/26/12

to

I certainly understand. That is why I held off explaining my goal of
eventually getting C programmers to do better until the second post. :-)

snip good discussion of mechanisms

> Largely, my approach is to let software do the hard parts of protection
> - managing flow control, preventing DOS. Let hardware do the easy parts
> - preventing access out of bounds.

I am not explaining myself well. :-( You seem to be focusing on the
data transfer part of the problem. This is understandable - it is
called message passing after all. :-) But i am talking about the
control transfer part. Let's look at a scenario. Note that this is
based on the actors model as I understand it. I may be wrong here, so
please correct me if appropriate.

A target process issues a receive instruction. It blocks as there is no
message*. A source process issues a send to send a message to the
target process. It goes into the queue for the target process. Now the
target process needs to wake up*. If the source sends another message
before the target has finishing processing the first one, the message
goes into the target process's' queue. When the target issues its next
receive instruction, it gets the second message. When it finishes
processing that, it issues another receive, and since there are no more
messages in its queue, it goes to sleep.*

The *s represent places where I think there needs to be involvement of
the software, specifically the thread or OS scheduler. This is even
assuming that the instructions know about the queue for the receiver and
can update it atomically. This is in the "normal" operation, not
worrying about DOS attacks, queue overflow, etc., though they too must
be handled.

So my question is, if you want to handle the normal cases purely in
hardware, how do you propose to handle the *ed cases above?

> Some asides:
>
> * It is greatly convenient to require software to set up a handle, a
> channel, for communication, and then have hardware/firmware operate on
> the data in the channel. Unfortunately, MPI's pattern matching
> requires better than that - it is impractical to establish N^2 channels
> for communication between N nodes. Let alone multicast. What I don't
> like is that MPI's pattern matching seems specific.

I agree about the handle. I don't know MPI so I can't comment on the rest.

> * Architectural message passing, in the ISA, is squeezed between shared
> memory on the one hand, and all of the message passing implementations
> that allow direct access to network queues without a syscall.

Yes, but that is the data flow again. I think you also have the control
flow stuff to worry about.

MitchAlsup

unread,

Jun 27, 2012, 12:42:16 PM6/27/12

to SF...@alumni.cmu.edu.invalid

On Tuesday, June 26, 2012 8:39:21 PM UTC-5, Stephen Fuld wrote:
> On 6/26/2012 10:10 AM, Andy (Super) Glew wrote:
> I am not explaining myself well. :-( You seem to be focusing on the
> data transfer part of the problem. This is understandable - it is
> called message passing after all. :-) But i am talking about the
> control transfer part. Let's look at a scenario. Note that this is
> based on the actors model as I understand it. I may be wrong here, so
> please correct me if appropriate.
>
> A target process issues a receive instruction. It blocks as there is no
> message*. A source process issues a send to send a message to the
> target process. It goes into the queue for the target process. Now the
> target process needs to wake up*. If the source sends another message
> before the target has finishing processing the first one, the message
> goes into the target process's' queue. When the target issues its next
> receive instruction, it gets the second message. When it finishes
> processing that, it issues another receive, and since there are no more
> messages in its queue, it goes to sleep.*
>
> The *s represent places where I think there needs to be involvement of
> the software, specifically the thread or OS scheduler. This is even
> assuming that the instructions know about the queue for the receiver and
> can update it atomically. This is in the "normal" operation, not
> worrying about DOS attacks, queue overflow, etc., though they too must
> be handled.

Along with the run-time queue management, there is at least as much OS work going on to setup the translation (page mappings) to the new message, and to teardown the translations from the previous message(s). Knowing that the setup and teardown of the page mappings targets a single core TLB makes shooting it down SO MUCH easier.

Mitch

Marven Lee

unread,

Jun 27, 2012, 6:59:52 PM6/27/12

to

Stephen Fuld wrote:
> A target process issues a receive instruction. It blocks as there is no
> message*. A source process issues a send to send a message to the
> target process. It goes into the queue for the target process. Now the
> target process needs to wake up*. If the source sends another message
> before the target has finishing processing the first one, the message
> goes into the target process's' queue. When the target issues its next
> receive instruction, it gets the second message. When it finishes
> processing that, it issues another receive, and since there are no more
> messages in its queue, it goes to sleep.*

If an actor can have many message ports then it could repeatedly poll all
of the ports to see if there are any messages. Alternatively a select()
type instruction that sleeps until there are ports with queued messages.

Also using coroutines to multithread an actor might be useful if a message
cannot be serviced immediately but the actor wants to be able to process
other messages.

--
Marv