Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Pedagogical examples of parallelism

4 views
Skip to first unread message

Jon Harrop

unread,
Nov 16, 2009, 12:33:32 PM11/16/09
to

There are many competing high-level solutions for parallel programming. For
example, Cilk-style wait-free work-stealing task deques. The pedagogical
example for this style of parallelism might be a parallel quicksort where
subsorts are performed in parallel on different parts of the array.

What other high-level solutions for parallel programming exist and what
problems are best solved using them? For example, when is nested data
parallelism preferable?

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

blm...@myrealbox.com

unread,
Dec 14, 2009, 3:20:17 AM12/14/09
to
In article <WoKdnWw3wpQCEZzW...@brightview.co.uk>,

Jon Harrop <j...@ffconsultancy.com> wrote:
>
> There are many competing high-level solutions for parallel programming. For
> example, Cilk-style wait-free work-stealing task deques. The pedagogical
> example for this style of parallelism might be a parallel quicksort where
> subsorts are performed in parallel on different parts of the array.
>
> What other high-level solutions for parallel programming exist and what
> problems are best solved using them? For example, when is nested data
> parallelism preferable?
>

A very belated reply ....

You don't mention the approaches I think would be most familiar to
traditional-HPC programmers using MPI and OpenMP, namely splitting
the iterations of a loop over a large array. Are those too simple
for what you have in mind?

--
B. L. Massingill
ObDisclaimer: I don't speak for my employers; they return the favor.

Jon Harrop

unread,
Dec 14, 2009, 1:37:28 PM12/14/09
to

The recursive subdivision of an array in quicksort is close enough to be the
same thing. You can easily solve it with task parallelism.

blm...@myrealbox.com

unread,
Dec 22, 2009, 7:56:11 AM12/22/09
to
In article <QKmdncBm7KwK6bvW...@brightview.co.uk>,

Jon Harrop <j...@ffconsultancy.com> wrote:
> blm...@myrealbox.com wrote:
> > In article <WoKdnWw3wpQCEZzW...@brightview.co.uk>,
> > Jon Harrop <j...@ffconsultancy.com> wrote:
> >> There are many competing high-level solutions for parallel programming.
> >> For example, Cilk-style wait-free work-stealing task deques. The
> >> pedagogical example for this style of parallelism might be a parallel
> >> quicksort where subsorts are performed in parallel on different parts of
> >> the array.
> >>
> >> What other high-level solutions for parallel programming exist and what
> >> problems are best solved using them? For example, when is nested data
> >> parallelism preferable?
> >
> > A very belated reply ....
> >
> > You don't mention the approaches I think would be most familiar to
> > traditional-HPC programmers using MPI and OpenMP, namely splitting
> > the iterations of a loop over a large array. Are those too simple
> > for what you have in mind?
>
> The recursive subdivision of an array in quicksort is close enough to be the
> same thing. You can easily solve it with task parallelism.

You think? Well, probably so, or you wouldn't have said so.
But to me these seem like fairly different things -- splitting
up the iterations of a loop among processes/threads can be done
statically (by which I mean that one can decide before beginning
the loop how to assign iterations to processes/threads), while I
don't think that's the case for quicksort. Hm. Mileage varies
with regard to what counts as "close enough to be the same thing"?
Or am I missing some point ....

Jon Harrop

unread,
Dec 24, 2009, 7:42:18 PM12/24/09
to

I see. You can precompute a parallel strategy with nested loops but not
arbitrary recursion like quicksort as you say. The former is being
reinvented as "nested data parallelism" in the Haskell world today but it
has been the pedagogical solution for parallelizing sparse linear algebra
on supercomputers for decades.

However, a precomputed parallel strategy tends to assume that equal-sized
tasks will take equal time to compute which is often not the case on a
multicore desktop due to OS scheduling, cache effects and so on. So you
really need dynamic load balancing to make efficient use of a modern
machine and that is most easily obtained by using the same divide and
conquer spawning tasks recursively that you use to parallelize solutions
like quicksort.

Moi

unread,
Dec 25, 2009, 8:36:25 AM12/25/09
to
On Mon, 16 Nov 2009 17:33:32 +0000, Jon Harrop wrote:


>
> What other high-level solutions for parallel programming exist and what
> problems are best solved using them? For example, when is nested data
> parallelism preferable?

I have experimented with parallel matrix transposition.
It works by sending a list of pairs-of-tile-numbers to swap and flip
into a pipe. The child processes consume the pairs (read() is atomic)
and do the actual work.

I think this is a natural candidate for parallelism, because the threads
are *known* not to interfere, only to compete for CPU and disk resources.

HTH,
AvK

Patricia Shanahan

unread,
Dec 25, 2009, 9:13:30 AM12/25/09
to

The threads also share cache lines, though not bytes within cache lines,
at the tile boundaries, unless you tune the tiling to the memory layout.

Patricia

Moi

unread,
Dec 25, 2009, 9:39:39 AM12/25/09
to
On Fri, 25 Dec 2009 06:13:30 -0800, Patricia Shanahan wrote:

> Moi wrote:

>
> The threads also share cache lines, though not bytes within cache lines,
> at the tile boundaries, unless you tune the tiling to the memory layout.
>
> Patricia

Yes, you are right, they compete for cache lines, too.
Given the enormous cost of bringing the disk pages into core, I tend to
ignore memory cache. Accesses are page aligned, and a tile typically consist
of (pagesize / sizeof element) pages.
(which is 1024 pages for a 4byte element on a 4k page intel box,
causing a total footprint of 8MB (per 2 tiles), which is bigger than my
L2 cache)


Once all the pages are pulled in, I expect a thread to complete its
flip&swap task in one sweep, so a thread competes only with itself for
cache slots. So it is more or less semi-cache-oblivious ;-)

AvK

0 new messages