What other high-level solutions for parallel programming exist and what
problems are best solved using them? For example, when is nested data
parallelism preferable?
--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u
A very belated reply ....
You don't mention the approaches I think would be most familiar to
traditional-HPC programmers using MPI and OpenMP, namely splitting
the iterations of a loop over a large array. Are those too simple
for what you have in mind?
--
B. L. Massingill
ObDisclaimer: I don't speak for my employers; they return the favor.
The recursive subdivision of an array in quicksort is close enough to be the
same thing. You can easily solve it with task parallelism.
You think? Well, probably so, or you wouldn't have said so.
But to me these seem like fairly different things -- splitting
up the iterations of a loop among processes/threads can be done
statically (by which I mean that one can decide before beginning
the loop how to assign iterations to processes/threads), while I
don't think that's the case for quicksort. Hm. Mileage varies
with regard to what counts as "close enough to be the same thing"?
Or am I missing some point ....
I see. You can precompute a parallel strategy with nested loops but not
arbitrary recursion like quicksort as you say. The former is being
reinvented as "nested data parallelism" in the Haskell world today but it
has been the pedagogical solution for parallelizing sparse linear algebra
on supercomputers for decades.
However, a precomputed parallel strategy tends to assume that equal-sized
tasks will take equal time to compute which is often not the case on a
multicore desktop due to OS scheduling, cache effects and so on. So you
really need dynamic load balancing to make efficient use of a modern
machine and that is most easily obtained by using the same divide and
conquer spawning tasks recursively that you use to parallelize solutions
like quicksort.
>
> What other high-level solutions for parallel programming exist and what
> problems are best solved using them? For example, when is nested data
> parallelism preferable?
I have experimented with parallel matrix transposition.
It works by sending a list of pairs-of-tile-numbers to swap and flip
into a pipe. The child processes consume the pairs (read() is atomic)
and do the actual work.
I think this is a natural candidate for parallelism, because the threads
are *known* not to interfere, only to compete for CPU and disk resources.
HTH,
AvK
The threads also share cache lines, though not bytes within cache lines,
at the tile boundaries, unless you tune the tiling to the memory layout.
Patricia
> Moi wrote:
>
> The threads also share cache lines, though not bytes within cache lines,
> at the tile boundaries, unless you tune the tiling to the memory layout.
>
> Patricia
Yes, you are right, they compete for cache lines, too.
Given the enormous cost of bringing the disk pages into core, I tend to
ignore memory cache. Accesses are page aligned, and a tile typically consist
of (pagesize / sizeof element) pages.
(which is 1024 pages for a 4byte element on a 4k page intel box,
causing a total footprint of 8MB (per 2 tiles), which is bigger than my
L2 cache)
Once all the pages are pulled in, I expect a thread to complete its
flip&swap task in one sweep, so a thread competes only with itself for
cache slots. So it is more or less semi-cache-oblivious ;-)
AvK