Loops and parallel execution

Georg Bauhaus

unread,

Jan 25, 2011, 10:40:38 AM1/25/11

to

A quick idea. Assume that some subprogram Op from package P
is reentrant (and does not depend on global state). Then,

with P;
...
for K in all First .. Last loop
P.Op (K);
end loop;

should have the effect of the following being permitted:

(a) to pick K from First .. Last in any order

(b) to execute P (J) in parallel with P (K) for J, K from
First .. Last

The same would be allowed for sufficiently simple expressions:

for K in all First .. Last loop
L(K) := Standard."*" (K, 3);
end loop;

Can this be borrowed from HPF (IIUC)?
Is pragma Pure (P) sufficient to signal reentrance?

Dmitry A. Kazakov

unread,

Jan 25, 2011, 11:37:03 AM1/25/11

to

No, it is not sufficient because it is wrong. P cannot be pure because all
instances of P.Op must be synchronized at the end of the "loop." You need
some frame, context relatively to which P might become pure. "Embedded
task", thread, fiber, call it as you want. If you have that at the language
level, then it becomes no matter whether the thing executed on such a
context is pure or not. This is similar to proper Ada tasks, you can access
shared data from a task as you wish. If you do this inconsistently that is
your problem (erroneous execution).

The point is, if you are going to somehow derive concurrency stuff from a
sequentially written program using pragmas and a "mind reading" compiler, I
doubt that could go anywhere. If you want to add light-weight embedded in
code tasking constructs a-la Occam, that might go, but I don't think that
they could be much useful. You need to map them onto OS services in order
to gain something, because normally there is no direct access to the cores.
That is not light-weight. Have you some certain OS in mind?

This thing you wanted in present Ada:

task type Worker (Do_Me : not null access procedure (K : Integer)) is
entry Op (K : Integer);
end Worker;
task body Worker is
I : Integer;
begin
accept Op (K : Integer) do
I := K;
end Op;
Do_Me (I);
end Worker;

procedure Print (K : Integer) is
begin
Put_Line (Integer'Image (K));
end Print;
...
declare
Threads : array (1..20) of Worker (Print'Access);
begin
for K in Threads'Range loop
Threads (K).Op (K);
end loop;
end;

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

Georg Bauhaus

unread,

Jan 25, 2011, 12:36:29 PM1/25/11

to

On 25.01.11 17:37, Dmitry A. Kazakov wrote:
> On Tue, 25 Jan 2011 16:40:38 +0100, Georg Bauhaus wrote:

>> Can this be borrowed from HPF (IIUC)?
>> Is pragma Pure (P) sufficient to signal reentrance?
>
> No, it is not sufficient because it is wrong. P cannot be pure because all
> instances of P.Op must be synchronized at the end of the "loop."

I was thinking of relatively small things, starting from this
observation:

If a compiler detects two independent paths inside a loop,
it will generate sequences of instructions that, while written
sequentially, will be executed in parallel by the processor
because the processor can do the corresponding distribution
across the many register words. Generalize that.

The Paraffin library just posted by Brad Moore looks like
it will resolve all other issues nicely! :-)

On an AS-IF basis, I thought of a tasking profile much < Ravenscar:
Other than exiting with a value, each P.Op (K) has no need for
communication. Each P.Op (K) is associated with a result object,
like this

compiler_type Result_Object is limited record
Ready : Boolean_Slot; -- a CAS bit?
Result : Value_Type;
end record;

The idea of employing Pure or something similar was to have
the compiler prevent each P.Op (K) from meddling in P.Op (J)'s
affairs, J /= K, as with protected objects, again on an
AS-IF basis, as much as possible.

Georg Bauhaus

unread,

Jan 25, 2011, 12:38:14 PM1/25/11

to

On 25.01.11 18:36, Georg Bauhaus wrote:

> The idea of employing Pure or something similar was to have
> the compiler prevent each P.Op (K) from meddling in P.Op (J)'s
> affairs, J /= K, as with protected objects, again on an
> AS-IF basis, as much as possible.

IOW, be better than OpenMP!

Dmitry A. Kazakov

unread,

Jan 25, 2011, 4:32:57 PM1/25/11

to

On Tue, 25 Jan 2011 18:36:29 +0100, Georg Bauhaus wrote:

> If a compiler detects two independent paths inside a loop,

I don't see any application for this. Can you remember the last time you
wrote such loop? I cannot.

The Occam's par-statement could be a better candidate, but I don't see how
this could be useful under a modern general-purpose OS with their
"vertical" parallelism, when each task is assigned to one core. The thing
you propose is "horizontal" parallelism, when a task/process would run on
all cores simultaneously. Inmos' Occam ran under no true OS, and the
processor architecture was well suited for such ad-hoc parallelism. Modern
processors are very different from T805 and I doubt that they would allow
an efficient implementation of this.

Georg Bauhaus

unread,

Jan 25, 2011, 5:07:01 PM1/25/11

to

On 1/25/11 10:32 PM, Dmitry A. Kazakov wrote:
> On Tue, 25 Jan 2011 18:36:29 +0100, Georg Bauhaus wrote:
>
>> If a compiler detects two independent paths inside a loop,
>
> I don't see any application for this. Can you remember the last time you
> wrote such loop? I cannot.

In fact, I have seen such a loop recently; it computes a Mandelbrot
set twice as fast. (I am confident that the lessons learned in
finding this loop have found applications in other loops that
manipulate larger amounts of numeric data.) The author has found a way
to split Ada's Complex type into its constituent parts (two FPT
objects) such that the program is a lot more efficient. (One would wish
that types like Complex would be treated specially by the compiler.)

> The Occam's par-statement could be a better candidate, but I don't see how
> this could be useful under a modern general-purpose OS with their
> "vertical" parallelism, when each task is assigned to one core. The thing
> you propose is "horizontal" parallelism, when a task/process would run on
> all cores simultaneously. Inmos' Occam ran under no true OS, and the
> processor architecture was well suited for such ad-hoc parallelism. Modern
> processors are very different from T805 and I doubt that they would allow
> an efficient implementation of this.

I have recently seen small boards carrying one processor each
that could be connected to one another on all sides, IIRC.
What matters thens is, I think, the efficiency of
(a) distribution of small computation, and
(b) the delivery of results at some nodes.
Is it therefore so unthinkable to have something like a transputer
these days?

BTW, FUD places the whole idea (from the early days, I guess)
subject to patent lawyerly action, nowadays, under names such
as map-reduce...

Yannick Duchêne (Hibou57)

unread,

Jan 25, 2011, 8:06:57 PM1/25/11

to

Le Tue, 25 Jan 2011 22:32:57 +0100, Dmitry A. Kazakov
<mai...@dmitry-kazakov.de> a écrit:

> On Tue, 25 Jan 2011 18:36:29 +0100, Georg Bauhaus wrote:
>
>> If a compiler detects two independent paths inside a loop,
>
> I don't see any application for this. Can you remember the last time you
> wrote such loop? I cannot.

I can see one: a kind of compiler optimization. I use to though about
something similar to what Georg exposed (except not strictly with loops),
which I called “micro-parallelism”. There are many case in an application
where some short sequence of instructions or groups of instructions does
not need to be sequenced. Typically I notice this when I do not know which
order to give these to make the source clear, as many orders would be
equivalent. Unfortunately, tasking is inefficient here (too much
overhead). You talked about Occam which I do not know (just the
principle), but could be fine, yes.

This kind of parallelism requires to be handled at low level (CPU or
else). This could be either marked explicitly by the author or detected by
the compiler as Georg suggested, as a kind of compiler optimization.

--
Si les chats miaulent et font autant de vocalises bizarres, c’est pas pour
les chiens.

“I am fluent in ASCII” [Warren 2010]

Yannick Duchêne (Hibou57)

unread,

Jan 25, 2011, 8:31:12 PM1/25/11

to

Le Tue, 25 Jan 2011 23:07:01 +0100, Georg Bauhaus
<rm-host...@maps.futureapps.de> a écrit:

> BTW, FUD places the whole idea (from the early days, I guess)
> subject to patent lawyerly action, nowadays, under names such
> as map-reduce...

What FUD ? What patent ? Please, tell more.

Egil Høvik

unread,

Jan 26, 2011, 3:46:27 AM1/26/11

to

On Tuesday, January 25, 2011 4:40:38 PM UTC+1, Georg Bauhaus wrote:
> A quick idea. Assume that some subprogram Op from package P
> is reentrant (and does not depend on global state). Then,
>
> with P;
> ...
> for K in all First .. Last loop
> P.Op (K);
> end loop;
>
> should have the effect of the following being permitted:
>
> (a) to pick K from First .. Last in any order
>
> (b) to execute P (J) in parallel with P (K) for J, K from
> First .. Last
>

You should take a look at Tucker Tafts blog about ParaSail,

"ParaSail allows many things to proceed in parallel by default, effectively inserting implicit parallelism everywhere"

http://parasail-programming-language.blogspot.com/2009/09/parasail-language-themes-and-philosophy.html

--
~egilhh

Dmitry A. Kazakov

unread,

Jan 26, 2011, 4:04:07 AM1/26/11

to

On Tue, 25 Jan 2011 23:07:01 +0100, Georg Bauhaus wrote:

> On 1/25/11 10:32 PM, Dmitry A. Kazakov wrote:

>> The Occam's par-statement could be a better candidate, but I don't see how
>> this could be useful under a modern general-purpose OS with their
>> "vertical" parallelism, when each task is assigned to one core. The thing
>> you propose is "horizontal" parallelism, when a task/process would run on
>> all cores simultaneously. Inmos' Occam ran under no true OS, and the
>> processor architecture was well suited for such ad-hoc parallelism. Modern
>> processors are very different from T805 and I doubt that they would allow
>> an efficient implementation of this.
>
> I have recently seen small boards carrying one processor each
> that could be connected to one another on all sides, IIRC.
> What matters thens is, I think, the efficiency of
> (a) distribution of small computation, and
> (b) the delivery of results at some nodes.

The Parix OS (actually a monitor) did that. E.g. if you called, say,
"printf" in a node which didn't have a direct link to the server (the
server was an MS-DOS PC or a Solaris workstation), the output would be
routed to the node connected to the server and from there to the server
which printed the output.

> Is it therefore so unthinkable to have something like a transputer
> these days?

I saw them too. BTW, they are in some sense a step back comparing to the
level Inmos arrived before its fall. They introduced a programmable TP-link
switch, so that you could reconnect the network of transputers on the fly.

But the problem is. I really see no use for the par-statement or alike. The
main argument against par is that using threads causes to much overhead. If
the argument stands, I mean if you don't have very long code alternatives
running in parallel for seconds, then using a mesh of processors would make
things only worse. The overhead to distribute the code and data over the
mesh of processors is much bigger than doing this on a machine with shared
memory (multi-core). There certainly exist examples of long independent
code alternatives, but I would say that most of them are constructed or
marginal.

Dmitry A. Kazakov

unread,

Jan 26, 2011, 5:08:43 AM1/26/11

to

On Wed, 26 Jan 2011 02:06:57 +0100, Yannick Duchêne (Hibou57) wrote:

> Le Tue, 25 Jan 2011 22:32:57 +0100, Dmitry A. Kazakov
> <mai...@dmitry-kazakov.de> a écrit:
>
>> On Tue, 25 Jan 2011 18:36:29 +0100, Georg Bauhaus wrote:
>>
>>> If a compiler detects two independent paths inside a loop,
>>
>> I don't see any application for this. Can you remember the last time you
>> wrote such loop? I cannot.

> I can see one: a kind of compiler optimization. I use to though about
> something similar to what Georg exposed (except not strictly with loops),
> which I called “micro-parallelism”.

OK, but you need no special constructs for this. The "mind reading"
compiler could optimize the standard for-loop without further hints.

> This kind of parallelism requires to be handled at low level (CPU or
> else).

If you had a dataflow signal processor...

Georg Bauhaus

unread,

Jan 26, 2011, 5:47:00 AM1/26/11

to

On 26.01.11 09:46, Egil Høvik wrote:

> You should take a look at Tucker Tafts blog about ParaSail,

I have, need to look more closely, though. Is anything known yet
about what John Barnes's Integrate example would look like if
written in idiomatic ParaSail?

Peter C. Chapin

unread,

Jan 26, 2011, 6:29:02 AM1/26/11

to

On Tue, 25 Jan 2011, Georg Bauhaus wrote:

> A quick idea. Assume that some subprogram Op from package P
> is reentrant (and does not depend on global state). Then,
>
> with P;
> ...
> for K in all First .. Last loop
> P.Op (K);
> end loop;
>
> should have the effect of the following being permitted:
>
> (a) to pick K from First .. Last in any order
>
> (b) to execute P (J) in parallel with P (K) for J, K from
> First .. Last

I've often wondered what it would take to support OpenMP (or something like
it) in Ada. The advantage with such an approach is that OpenMP is well
documented and widely used and understood. Right now the OpenMP standard
only supports C (and C++?) and Fortran. Why not Ada?

Peter

Randy Brukardt

unread,

Jan 26, 2011, 4:57:12 PM1/26/11

to

"Georg Bauhaus" <rm.dash...@futureapps.de> wrote in message
news:4d3eeef7$0$6879$9b4e...@newsspool2.arcor-online.net...
...

> Can this be borrowed from HPF (IIUC)?
> Is pragma Pure (P) sufficient to signal reentrance?

I've thought about such an idea. But it requires restrictions well beyond
those enforced by pragma Pure. For instance, Pure packages can write
dereferences of pointers to keep global state. Moreover, there can't be any
"cheating", which is common in pragma Pure packages.

So there would need to be a new kind of categorization for this. I was
hoping that we could using the proposed global in/global out categorizations
to do the job, but those got dropped from Ada 2012.

Also, I think that "no communication" is impractical in most real
applications. But it is sufficient if the communication is tightly limited
(via atomic and protected objects, and/or synchronized interfaces - you'll
need to access global data, just safely). That's another reason why "checked
global in/global out" is needed.

Finally, like Dmitry, I'm skeptical about fine-grained parallelism buying
much. Unless there is specific architectural support (something that doesn't
exist in commonly used processors -- and especially in commonly used target
OSes/RTOSes), the management overhead will kill any savings on "small"
expressions. Thread creation is not cheap! The "win" is on larger tasks -
which means that subprograms - and separately compiled subprograms - have to
be involved in some way.

My main interest in this technology is to make it much easier to create
programs that use threads but don't deadlock, livelock, or have dangerous
use of globals. That seems to require restrictions on what you can do, and
definitely requires some form of compile-time checking to enforce those
restrictions. If done usefully, that could be a giant win, as you could use
sequential reasoning for the majority of your programming and debugging, and
still get parallelism when useful.

Randy.

tmo...@acm.org

unread,

Jan 27, 2011, 6:01:54 PM1/27/11

to

> Finally, like Dmitry, I'm skeptical about fine-grained parallelism buying
> much. Unless there is specific architectural support (something that doesn't
> exist in commonly used processors -- and especially in commonly used target
> OSes/RTOSes), the management overhead will kill any savings on "small"

What about the SIMD (vector) instructions in Intel CPUs? Or is that
better done by simply calling their optimized, CPU capability detecting,
libraries?

Randy Brukardt

unread,

Jan 28, 2011, 7:23:13 PM1/28/11

to

<tmo...@acm.org> wrote in message news:ihsth1$igr$1...@speranza.aioe.org...

That's a code generation problem; I don't believe that there is much if any
value to the programmer cluttering their code with parallel operations in
that point.

To expand on that a bit: code generation for a CISC machine is primarily a
pattern matching problem. That is, the intermediate code is a list of very
simple pseudo instructions, and the code generator needs to map those to
more complex machine instructions (along with simple ones when the pattern
matching fails). Matching SIMD instructions is a more complex problem than
the simple matcher used in Janus/Ada (to take the example I'm most familar
with), but it is fundementally the same problem. In this case, I would
probably apply a loop unrolling optimization, then a series of pattern
matching operations to create the SIMD instructions.

We already do something like this for aggregates in Janus/Ada. An aggregate
assignment like:

My_Str := (others => Ch)

can get turned into the Intel STOSB (I think that's the right opcode)
instruction (plus a bit of setup code); which is a lot simpler than the loop
that would be otherwise generated.

In either case, you'll automatically get the benefit of the advanced
instructions when they can be used, and no code changes are needed. Of
course, if your code doesn't match the pattern, the advanced instructions
wouldn't be used, but it's unlikely that adding a "parallel" direction to
the loop would somehow change that.

I'd be surprised if GCC doesn't already do something like this. (This
particular problem hasn't been on my radar, in part because I didn't even
have a machine that supported most of those instructions until last year.)

Randy.

Paul Colin Gloster

unread,

Jan 31, 2011, 8:01:04 AM1/31/11

to

In news:op.vpv5dvb1ule2fv@garhos it was mentioned:
|------------------------------------------------------------------------|
|"[..] |
| |
|[..] |
| |
|"[..] There are many case in an application |

|where some short sequence of instructions or groups of instructions does|

|not need to be sequenced. [..] |
|[..] |
| |
|[..]" |
|------------------------------------------------------------------------|

Instruction-level parallelism?

Yannick Duchêne (Hibou57)

unread,

Feb 6, 2011, 3:06:32 PM2/6/11

to

Le Mon, 31 Jan 2011 14:01:04 +0100, Paul Colin Gloster
<Colin_Pau...@acm.org> a écrit:

Like in graphic cards ?

Yannick Duchêne (Hibou57)

unread,

Feb 6, 2011, 3:10:49 PM2/6/11

to

Le Sat, 29 Jan 2011 01:23:13 +0100, Randy Brukardt <ra...@rrsoftware.com>
a écrit:

> We already do something like this for aggregates in Janus/Ada. An
> aggregate
> assignment like:
>
> My_Str := (others => Ch)
>
> can get turned into the Intel STOSB (I think that's the right opcode)

Yes, if I remember correctly, this is precisely “rep stosb”, with a
counter in the ECX (or CX) register. Stosb alone would just update source
and destination register (SI and DI). This use, although valid, is less
common.

Nicholas Paul Collin Gloster

unread,

Feb 7, 2011, 6:43:39 AM2/7/11

to

On 2011-02-06, Yannick Duch�ne <yannick...@Yahoo.Fr> sent:
|--------------------------------|
|"[..] |

|> Instruction-level parallelism?|
|Like in graphic cards ?" |

|--------------------------------|

Like in pipelined C.P.U.s: the C.P.U. can fetch one
instruction while an earlier instruction is being executed.

Tuck

unread,

Feb 14, 2011, 6:27:29 PM2/14/11

to

It might look something like this:

function Integrate(
function Func(X : Floating is Float<>) -> Floating;
Over : Interval<Floating>)
-> Floating is
...
end function Integrate;

...

const Result := Integrate(Sin, Over => 0.0 .. Pi/2.0);

-Tuck

Georg Bauhaus

unread,

Feb 15, 2011, 4:10:05 PM2/15/11

to

Thanks, I must have overlooked nested declarations.
(Thought that one would have to make passing values
explicit; thus using an outer function's argument
within another function local to the former would
have been ruled out. Doesn't seem so!)