APL in 2020 - Parallel processing

Phil Last

unread,

Jul 10, 2010, 9:52:32 AM7/10/10

to

My first job in computing 30 years ago took me straight into APL. Even
back then there was much talk about how parallel processing was the
future and that APL automatically steals a march on the rest of the
computing world because its structures are inherently parallel.

Thirty years later we hear of tentative steps to implement such things
as "parallel each". As if array operations such as inner- and outer-
product, scan, reduction and each can be anything other than parallel,
at least conceptually.

Should vendors be striving to provide tools for us to make decisions
about how to use our extra processors or should it be implicit with
APL automatically splitting arrays into convenient chunks to be passed
out and processed simultaneously?

Or a third alternative: should the language itself contain structures
similar to those in the parallel language "occam"?

:Parallel
(c0⌿r)←f00 c0⌿r
(c1⌿r)←f01 c1⌿r
(c2⌿r)←f02 c2⌿r
(c3⌿r)←f03 c3⌿r
:EndParallel

Richard

unread,

Jul 11, 2010, 3:57:19 AM7/11/10

to

There has been discussion of this topic in the J forum.
The net is that it is not as simple as appears...
In real world APL/J apps the percent of pure vector ops isnt
as high as many people think, so the benefits may be marginal in any
case.
I ask if the interpreters are getting as much out of current
chips as they could? eg do they use the SS instructions
and vectorise the code?
Are the SS instructions unusable by our interpreters because the
instructions are 32 bit FP only?
I think later chips have better handling of 64 bit FP.
To exploit multicore you have a lot of overhead in building and
despatching separate threads.
The latest APL+Win (V10) includes support at the app level
for multithreading so users can build parallelism if they wish.

Message has been deleted

Joe_Blaze

unread,

Jul 12, 2010, 12:59:16 AM7/12/10

to

... The latest APL+Win (V10) includes support at the app level for
multithreading so users can build parallelism if they wish...

The APL+Win V10 enhancement for multi-threading is an enhancement to
APL+Win which can also be used by any .Net language to manage multiple
parallel threads of APL+Win.

No such 'enhancement' is necessary for VisualAPL because as a .Net
programming language, it inherently supports multi-threading using the
System.Threading namespace of the .Net Framework.

Both APL+Win V10 and VisualAPL adopt programmer-controlled multi-
threading to achieve asynchronous, parallel processing.

Dick Bowman

unread,

Jul 12, 2010, 2:05:45 AM7/12/10

to

On Sat, 10 Jul 2010 06:52:32 -0700 (PDT), Phil Last wrote:

> [... deleted ...]

>
> Thirty years later we hear of tentative steps to implement such things
> as "parallel each". As if array operations such as inner- and outer-
> product, scan, reduction and each can be anything other than parallel,
> at least conceptually.
>

> [... deleted ...]

I am also a little puzzled about "parallel each" - and have some vague
recollection of discussions where some people insisted on knowing the order
in which "each" (plain vanilla) operated. Personally I've always taken the
view that array operations should work on whole arrays and that I should
not exploit knowing how one interpreter might work its way through
element-by-element (partly so as I can move my code to another interpreter
and it still works the same way, and mostly so that I don't go bleating to
the vendor when they make a globally- beneficial change that breaks my
"special understanding"). If things need to happen in a special order, I
have the option to loop.

The question raised lower down by Richard Hill reminds me of the study
(thought it was APL82, but can't find the exact source) where somebody
counted array sizes in typical applications and came up with quite low
figures. Has anyone repeated this more recently or is the evidence
anecdotal - things may have changed? Certainly Dyalog has included the
tools needed to do this.

As for multi-threading - my practice has been to use multithreading to let
the application "do different things at the same time" rather than work in
parallel over an array - for example to let it go off and scrape some data
out of Web pages while the user analyses previously-saved data.

Something I've seen mentioned away from the APL context is using graphic
processor chips to do the number-crunching - because the hardware there is
multimultimulticore and there really are huge possibilities for parallel
processing. If we're working on really big whole arrays that might be
something with a future (although I worry about lock-in to specific
hardware).

David Liebtag

unread,

Jul 12, 2010, 8:34:56 AM7/12/10

to

APL2 has supported parallel each since 2008. Two flavors are available:
PEACHT which uses multiple threads and is appropriate for use on machines
with multiple cores and PEACHP which uses multiple processes and is
appropriate for distributing applications across multiple machines. Both
PEACHT and PEACHP require that the function to be applied reside in a
namespace and have no side effects (the function has no access to the
calling namescope.)

PEACHT and PEACHP are documented in the APL2 User's Guide which is available
for download here:

http://www.ibm.com/software/awdtools/apl/library.html

David Liebtag
IBM APL Products and Services

Michael Jenkins

unread,

Jul 12, 2010, 10:39:47 AM7/12/10

to

Two comments:
On use of graphics chips for fast parallel computation.
The number of APL operations that can benefit from this are small, but
very important.
Maybe a goal for 2020 would be to build libraries of support routines
for this small subclass of operations and share them among all the
vendors/implementers. The library could have versions for each widely
distributed graphics chip and the cover routines could auto-select
which version to use based on available hardware. By sharing the work
of creating these very specialized routines the whole community would
benefit.

Parallel Each:
One possible use of parallel EACH is for problems that are basically
tree structured, but require a substantial computation at each
abstract leaf node. Such problems can be represented by nested arrays
in a direct manner, but the direct representation reduces the
opportunities for parallelism. Another approach is to encode the tree
nesting structure separately from the abstract leaf nodes, which can
be kept in a large array. Then the computation needed for each
abstract leaf can be directly executed in parallel with a Parallel
each operation. An example of this type of problem is the N-body
problem.

Bjorn

unread,

Jul 12, 2010, 11:36:56 AM7/12/10

to

I noticed the migration guide ( from VS APL to APL2 ) among the
publications

http://publibfp.boulder.ibm.com/epubs/pdf/h2110691.pdf

Do you have any idea how many percent of the original VS APL
installations who now still 30 years on have not migrated?

Do you still do any work on VS APL?

Bjorn

unread,

Jul 12, 2010, 12:04:11 PM7/12/10

to

I like to think about parallel computing as an option of one
application controlling many applications.

Sending tasks to be processed and monitoring progress.

The tasks could be on the same machine or different machines.

The controlling application would have a task manager to report what
is going on in different tasks and eventually allow to stop tasks and
set interactions as well as dependencies between tasks.

This opens up a whole array of possibilities and options.

Over the years I have created various applications using files to
control progress of tasks and interaction between machines.

It can be very messy and complicated and a well defined way of doing
this would surely be interesting.

It is of course also interesting to mix in processes in various
languages.

David Liebtag

unread,

Jul 12, 2010, 6:26:17 PM7/12/10

to

Bjorn,

We haven't made any changes to VS APL in years.

Phil Last

unread,

Jul 13, 2010, 6:25:59 AM7/13/10

to

I agree with Dick regarding the danger and undesirability of relying
on particular knowledge of implementation of array operations which
are conceptually parallel. Thus almost all array operators could be
made actually parallel without changing a line of code.

Richard points out that automatically using parallel architecture "is
not as simple as it appears" and that "benefits may be marginal".

I should have thought the cost-benefit analysis could easily be coded
into the interpreter using current hardware configuration as input.
Multi-dimensional APL arrays naturally split into cells that could be
directed to different processors if an overall benefit was deemed to
be likely.

Thus it seems to me that a single process applied to large data arrays
could be made beneficially to use the hardware without the code having
to request it.

Starting a number (other than two) of different processes in parallel
is a different matter entirely. APL currently doesn't have the syntax
to do it (but see counter-example below). Various ways of bifurcating
a process; (I believe it was) B-tasks in Sharp APL; Spawn in Dyalog;
no doubt there are others; permit one APL process to start and monitor
another. Combined with the "each" operator many instances of the other
task can be started together. None of this is new. But neither does
any of it permit me to start more than one DIFFERENT task at the same
time. We can imagine a dyadic operator that runs both its operands
simultaneously but even that would still only be a bifurcation.

The conjectural control structure at the top of this thread would be
one way.

A very contrived counter-example uses the dotted syntax of Dyalog.
Each different task is assigned to a different namespace but with the
same name. One way to do this would be to create a class for each task
with a homonymous method (eg. task). With all the namespaces (one
instance of each class) in an array (eg. nss) we can then run:

nss.task data

which will distribute tha data scalarwise to the tasks.

Bjorn

unread,

Jul 13, 2010, 6:31:10 AM7/13/10

to

I am sure a lot of people are still using it - unchanged - after all
these years.

Just goes to proof how much of a quality product it is.

I bet most of the users are also using ADI, ADRS and such quality
applications.

Ibeam2000

unread,

Jul 13, 2010, 8:54:15 AM7/13/10

to

> On use of graphics chips for fast parallel computation.
> The number of APL operations that can benefit from this are small, but
> very important.

The last time I looked at this, graphics boards typically used single
precision floating point arithmetic, as at the end of the graphics
pipeline, whatever result was eventually going to be rounded to the
featured integer coordinate system, thus reduced precision was OK.
Single precision may be good enough for casual use, but may produce
incorrect results for something like a matrix inversion. Also, the
floating point employed in some of the graphics boards is not exactly
the same as the IEEE 754 standard used in standard CPUs. The point of
having an onboard array floating point processor was to speed up the
rendering process and shortcuts (which translate to slightly different
answers) are or were the norm.

AAsk

unread,

Jul 13, 2010, 4:18:55 PM7/13/10

to

The semantics for threading need to be 1. simple 2. under the control
of the developer at his/her option.

In C# AND Visual APL:

1. Simplest method: call a function inside a thread: it runs
asynchronously & the thread tidies up on completion

new System.Threading.Thread(Niladic).Start(); // Run function
'Niladic' Asynchronously

2. For dyadic (or polyadic functions):

new System.Threading.Thread(() => Dyadic("ME ", 150)).Start();

3. For multiple functions, sequentially, inside the same thread:

System.Threading.Thread t2 = new System.Threading.Thread(() =>
{
OneOfTwo("Ajay");
TwoOfTwo("Askoolum");
});
t2.Start();

In the third example, the instance of the thread (namely, t2) may
remain available to other processes for interrogation.

I haven't tried using the APLNextSupervisor yet.

Joe_Blaze

unread,

Jul 13, 2010, 5:28:31 PM7/13/10

to

... without changing a line of code...
The notion that significant improvements in performance can be
achieved without the application system programmer making any effort
has been a fondly-remembered fixture of IT even before Moore's Law was
first enunciated. Now however, significant increases in component
density, clock speed and reduction in connection widths have not been
readily forthcoming from the hardware suppliers. Whereas these
previous hardware innovations have produced almost automatic
performance increases, the manifestations of multi-core and multi-
processor hardware merely provide the programmer with the opportunity
to re-configure their applications for asynchronous processing with a
potential for performance improvements. Performance improvements due
to the incorporation of asynchronous processing in a specific
application system are achieved only at the extremes. The extremes
intersect when many executions of an in-memory calculation intensive
method occur in an application system. Typically these are time series
analysis, simulations of weather or chemical reactions, group theory
or population dynamics. With multi-core processors, the application
system programmer will need to significantly re-configure only the
appropriate application systems and do that re-configuration
appropriately.

... automatically using parallel architecture ... "benefits may be
marginal"...
In most cases symmetric multi-processing is all that can be made
available to the application system programmer in a programming
language as general as APL. Because of data marshalling costs between
the controlling application and the threads, SMP yields benefits only
in extrement cases of highly-repetitive, in-memory algorithms without
'side effects' that process very large arrays.

... the cost-benefit analysis could easily be coded into the
interpreter...
The performance cost to the overall application of such operator-level
analysis by the interpreter is not insignificant. Adding interpreter
analysis to language-level operators to detect the relatively rare
conditions where multi-processing would yield benefits burdens the
performance for the vast majority of operator executions. For general
use programming languages such as APL, application system programmer
intervention to select the areas where asynchronous processing is
beneficial is the model used by Microsoft for C#, APLNext for
VisualAPL and the APL+Win Supervisor and IBM APL2 with 'peach'.

aleph0

unread,

Jul 14, 2010, 6:01:20 AM7/14/10

to

A Parallel Machine purely for APL was built in 1989 in Saarbrücken by
( now Dr. ) Jürgen Sauermann , under Prof. Paul's regie.

http://genealogy.math.ndsu.nodak.edu/id.php?id=63307

The machine and the software worked as described as per Jürgen's
dissertation.

Some comments I can remember worth thinking about :

+ the machine "effectively" had no Operating System
i.e. APL talked directly to the hardware

An OS would effectively "just get in the way" of liaising the
requirements of the APL interpreter with the hardware.

+ Workspaces were loaded into memory "in parallel" from a disk array.

Minister La Fontaine visited the project out of special interest ( he
has a Physics degree IIRC )
AFAIK, Deutsche Telekom used the machine on the Hanover Fair to
demonstrate their ( yet to come ) Internet backbone ( Glass Fibre
IIRC ).

Maybe if Jürgen is here somewhere, he could elaborate ?

Ibeam2000

unread,

Jul 17, 2010, 1:55:28 AM7/17/10

to

Nearly all CPUs today feature some parallel processing capabilities.
Intel released Streaming SIMD Extensions (SSE) to their Pentium 3
processors in 1999. This has been enhanced over the years.
Essentially, the array processor, as such, is a 128 bit buffer which
can be partitioned as required, 2 64-bit numbers, 4 32-bit numbers, 8
16-bit numbers, or 16 8-bit numbers. Later versions went up to 256
bits.

See http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions

Whether or not using these extended features would be significantly
faster is one thing, though some of the instruction variants could
potentially benefit APL operations which involve vector additions to
small (8 or 16 bit) integers. As a practical matter, using these
extended instructions might complicate the software delivery problem
as a vendor may have to support several CPU types with several
versions of the same executable.

IBM had some mainframe vector processing extensions which showed up in
the late 1980s which were supported by both APL2 and Fortran. If I
remember correctly, some of these instructions worked with 128 double
precision floating point numbers.

Curtis A. Jones

unread,

Jul 20, 2010, 5:15:04 PM7/20/10

to

See David Patterson, "The Trouble With Multi-Core", IEEE Spectrum Vol.
47, No. 7, pp. 28 ff (July 2010)
http://spectrum.ieee.org/computing/software/the-trouble-with-multicore

Patterson argues that since we can't run processors much faster and
get rid of the heat, industry is building computers with more
processors, but what to do with them is still uncertain. In a
sidebar: "There's no lack of languages designed to support parallel
processing -- this is just a selection of them. Still, they don't
make parallel processing easy or straightforward." The list has 50
languages, including Ada, APL, LabView, Modula-3, SISAL and ZPL. No
A, J, or K! In the body of the article: "One early vision was that
the right computer language would make parallel programming
straightforward. There have been hundreds - if not thousands - of
attempts at developing such languages, including such long-gone [!!!!]
examples as APL, Id, Linda, Occam, and SISAL. Some made parallel
programming easier, but none has made it fast, efficient, and flexible
as traditional sequential programming. Nor has any become as popular
as the languages invented primarily for sequential programming."

cjl

unread,

Jul 23, 2010, 8:49:16 AM7/23/10

to

Is anyone working on a concrete parallel processing problem? I'm
looking for a specific real-world example?

ps Is Gary Berquist watching this forum? I think he's the most
recent APL shared libraries advocate. Or at least he talked about it
at the APL2000 conference in April 2010.

Peter Keller

unread,

Jul 23, 2010, 11:36:21 AM7/23/10

to

Phil Last <phil...@ntlworld.com> wrote:
> Thirty years later we hear of tentative steps to implement such things
> as "parallel each". As if array operations such as inner- and outer-
> product, scan, reduction and each can be anything other than parallel,
> at least conceptually.

I happened to stumbled across this paper while looking for something else
compiler related and it seemed somehwat germane to the discussion:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.42.2814
also here
http://www.snakeisland.com/techrpt.pdf

Have fun.

Later,
-pete

Richard Nabavi

unread,

Jul 23, 2010, 1:39:16 PM7/23/10

to

On 10 July, 14:52, Phil Last <phil.l...@ntlworld.com> wrote:
>
> Should vendors be striving to provide tools for us to make decisions
> about how to use our extra processors or should it be implicit with
> APL automatically splitting arrays into convenient chunks to be passed
> out and processed simultaneously?
>

Ideally, you don't want to embed any detailed knowledge of the number
of extra processors or the trade-offs into your code, because you
don't want to have to re-tune if the software is run on a different
configuration.

Modern parallel-processing libraries/toolkits such OpenMP (for C code)
and Intel Threaded Building Building Blocks (for C++ code) try to
address this problem by automatically estimating the trade-off at
runtime. You don't need to change your application to get the
advantage of going from 1 to 2 to 4 or up to dozens of processor
cores.

So probably what we should consider is:

1) Automatic parallelisation of intrinsic functions or sequences of
APL operations, where this is guaranteed to be safe (ie without side
effects). For example, the interpreter can seamlessly do this for
operations on large vectors; it could split up +/V into 2, 4, 8, ...
sub-reductions (depending on the number of processor cores available)
and then add up the intermediate results at the end. Of course,
whether this is worthwhile depends on the overhead of firing off and
coordinating the separate sub-calculations.

At present, the opportunities for doing this are fairly limited, and
the benefit in terms of performance not that great overall (the
development effort for the APL interpreter writers would probably be
better spent on improving algorithms elsewhere in the interpreter).
But that could change; at present, we are typically looking at
processors with 4 or 8 cores. What if processors with 128 or 256
cores become commonplace?

2) Some syntax in APL to allow the programmer, with knowledge of what
is really going on, to say: 'This chunk of work can be executed in
parallel, if there is an advantage in doing so'. Parallel Each is
effectively one example of this approach. But it is important to let
the low-level software, which has knowledge of the hardware at runtime
(and possibly knowledge of what other tasks are running) take the
decision as to whether or not to actually execute in parallel.

Richard Nabavi
MicroAPL