parallelizing tree-like runtime code paths

Hicham Mouline

unread,

Sep 6, 2010, 2:57:30 AM9/6/10

to

hello,

I have a sequential code that executes the following pseudo-code:
There are 'm' parameters, where m is a runtime value, not compile time.
Each of the 'm' parameters can take a different set of different values, ie

parameter i1 can take values i1_1 .... i1_n1
parameter i2 can take values i2_1 .... i2_n2
...
parameter i_m can take values im_1 .... i2_n_m

where everything is input at runtime, nothing is known at compile time.
n1, n2, ..., n_m are most likely to be different. Some of them maybe just 1.

For each possible m-tuple, we run a function that takes the m-tuple, and
produces an output, that we store in a m-dimensional array.

If m=5, the code will execute the equivalent of:
--for (i1)
----for (i2)
------for (i3)
--------for (i4)
----------for (i5)
------------results_matrix_5_dims[i1][]....[i5] = some_function(i1, ....,
i5)

Obviously, it isn't actual for loops. Rather it is a recursive function that
looks like (still pseudo-code but much closer to the real code):

for_recurse( int level, ... )
{
--- // if level==0, exit
---for (....) /// a particular level
-----for_recurse(level-1, ...)
}

// start
for_recurse( m, ... )

the question is: how to parallelize the above?
The number of runs of 'some_function' is n1*n2*...*n_m.
Ideally, if there are x cores in 1 machine, in a very simplistic model,
assuming the system is completely idle besides this program, and no threads
need to sync re writing to the result matrix because the index in in the
matrix is always different in different threads, there would be x threads.
Each thread would call n1*n2*...*n_m/x times the 'some_function'.
some_function is thread safe.
But how to divide up the work?

regards,

Ersek, Laszlo

unread,

Sep 6, 2010, 7:44:17 PM9/6/10

to

(Entire context kept for clarity, sorry!)

One idea is to be explicit about the index space. Let's first return to
the iterative notation (the nested loops). There will be no need to
implement it, just imagine it.

What you actually do is traverse the integers in [0, n1*n2*...*n_m) --
left-inclusive, right-exclusive. Let's call this moving integer value X, a
"flat" index. The broken-down m-tuple (the index vector) that
some_function() needs to work is actually a *mixed-base representation* of
X, where the "digits" or "decimals", or more exactly, the number positions
-- instead of using the same base (like 2 in binary, 10 in decimal, 16 in
hexadecimal) -- use n1, n2, n_m.

If you observe X's domain (which has a single dimension), you can easily
split that into T partitions by simple integer division (where T is the
number of threads/cores/logical processors/whatever). Then you need three
operations:

1) convert any given X in said range to mixed-base representation, so that
you can "jump" to the starting vector representation of a partition,
2) increment a mixed-base representation by one, maintaining carries, so
that you can advance within a partition,
3) compare X1 and X2 for equality ("less-than" isn't hard either, it works
lexicographically), so that you know when you've reached the end of the
partition.

Here's a sample implementation. I compiled and lightly tested most parts
of it. You can use it verbatim if you like it, I hereby place it in the
public domain. (Don't laugh! :)) No warranty, obviously.

unsigned dim; /* number of dimensions, "m" */
unsigned *base; /* base values for each dimension, "n1" ... "n_m" */

value_type **param_val;
/*
for each j in [0..dim-1], param_val[j] has valid indices in
[0 .. base[j]-1]
*/

/*
A tuple is represented as an array of unsigned's; the lowest offset
is the most significant.
*/

/* compute and cache the (exclusive) upper bound of the flat domain */
static long unsigned
get_maxflat(void)
{
static long unsigned maxflat;

if (0LU == maxflat) {
unsigned i;

maxflat = 1LU;
for (i = 0u; i < dim; ++i) {
maxflat *= base[i]; /* suppose no overflow */
}
}
return maxflat;
}

/*
convert flat to mixed-base
[0 .. maxflat-1] works as expected,
for maxflat, the most significant position will equal base[0]
(that is, last carry is stored "in-line")
*/
static unsigned *
flat_to_tuple(long unsigned flat, unsigned *tuple)
{
long unsigned divisor;
unsigned i;

divisor = get_maxflat();
for (i = 0u; i < dim; ++i) {
divisor /= base[i];
tuple[i] = flat / divisor;
flat %= divisor;
}
assert(0LU == flat);
return tuple;
}

/* increment mixed-base */
static unsigned *
tuple_incr(unsigned *tuple)
{
unsigned i;

i = dim;
for (;;) {
--i;
/*
tuple[0] is allowed to grow to base[0], so that maxflat can be
represented here too
*/
if (++tuple[i] < base[i] || 0u == i) {
break;
}
tuple[i] = 0u;
}

return tuple;
}

/* compare */
static int
tuple_cmp(const unsigned *t1, const unsigned *t2)
{
unsigned i;

i = 0u;
while (i < dim) {
if (t1[i] < t2[i]) {
return -1;
}
if (t1[i] > t2[i]) {
return 1;
}
++i;
}
return 0;
}

struct thr_arg
{
unsigned *lo,
*hi;
};

static void *
thrfunc(void *);

/* after "dim", "base" and "param_val" were initialized */
static void
spawn(unsigned numthr)
{
struct thr_arg *args;
unsigned thr;

/* suppose no size_t or unsigned overflow */
args = malloc(numthr * sizeof *args);
assert(0 != args);

for (thr = 0u; thr < numthr; ++thr) {
long unsigned lo,
hi;

/* naive */
lo = get_maxflat() * thr / numthr; /* suppose no overflow */
hi = get_maxflat() * (thr + 1) / numthr; /* here neither */

args[thr].lo = malloc(dim * sizeof args[thr].lo);
assert(0 != args[thr].lo);
args[thr].hi = malloc(dim * sizeof args[thr].hi);
assert(0 != args[thr].hi);

flat_to_tuple(lo, args[thr].lo); /* inclusive */
flat_to_tuple(hi, args[thr].hi); /* exclusive */

/* too lazy to look up pthread_create()'s parameters */
spawn_thread(&thrfunc, args + thr);
}

/*
join all threads here and free the "args" array and the arrays owned
by it
*/
}

static void *
thrfunc(void *arg)
{
unsigned *lo,
*hi;

lo = ((struct thr_arg *)arg)->lo; /* inclusive */
hi = ((struct thr_arg *)arg)->hi; /* exclusive */
while (-1 == tuple_cmp(lo, hi)) {
/*
call function with "lo", which will use param_val[0][lo[0]] to
param_val[dim-1][lo[dim-1]]
*/
(void)tuple_incr(lo);
}
}

Here is a sample partitioning, generated by a slightly wrapped version of
the above, where "dim" ("m") equals 3, and the bases (n1, n2, n3) are (2,
3, 4). The number of threads/partitions ("T") is 5.

There are 2*3*4 = 24 invocations to do. It is not divisible by 5, but it
is no problem, the partitioning is as evenly distributed as possible.

The "lo" flat index bound for a given thread/partition is always
inclusive, while the "hi" flat index bound is always exclusive.

---- thr=0 lo=0 hi=4
0 0 0
0 0 1
0 0 2
0 0 3

---- thr=1 lo=4 hi=9
0 1 0
0 1 1
0 1 2
0 1 3
0 2 0

---- thr=2 lo=9 hi=14
0 2 1
0 2 2
0 2 3
1 0 0
1 0 1

---- thr=3 lo=14 hi=19
1 0 2
1 0 3
1 1 0
1 1 1
1 1 2

---- thr=4 lo=19 hi=24
1 1 3
1 2 0
1 2 1
1 2 2
1 2 3

Cheers,
lacos

Ersek, Laszlo

unread,

Sep 6, 2010, 8:08:28 PM9/6/10

to

On Tue, 7 Sep 2010, Ersek, Laszlo wrote:

> struct thr_arg
> {
> unsigned *lo,
> *hi;
> };

> args[thr].lo = malloc(dim * sizeof args[thr].lo);
> assert(0 != args[thr].lo);
> args[thr].hi = malloc(dim * sizeof args[thr].hi);
> assert(0 != args[thr].hi);

That's "sizeof *args[thr].lo" and ".hi", obviously. Sorry.

lacos

aminer

unread,

Sep 7, 2010, 12:00:28 AM9/7/10

to

kif 7alek ya ssi Hicham ? :)

wa ma' tenssa had el klimat wa el qssidat enta3i..

don't forget about my new poems in arabic El Ghorba wa el firaaq:

http://pages.videotron.com/aminer/poems.html

If you read my poems and understood them very well
they will make you smarter... and my poems are better
than parallelism that can make you crazy :)

take care :)

Regards,
Amine Moulay Ramdane.

aminer

unread,

Sep 7, 2010, 12:10:38 AM9/7/10

to

Hello Hicham ,

wa ma' tensaa' had el klimaat wa el qssidaat elli qtabthoum..

don't forget about my new poems in arabic ,
El Ghroba wa el Firaaq

http://pages.videotron.com/aminer/poems.html

http://pages.videotron.com/aminer

If you read my poems carefully and understood them,
they will make you better and smarter.. they are better
than parallel programming that can make you crazy, smile :)

take care.. :)

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer

On Sep 6, 2:57 am, "Hicham Mouline" <hic...@mouline.org> wrote:

James

unread,

Sep 7, 2010, 12:32:30 AM9/7/10

to

"aminer" <ami...@videotron.ca> wrote in message
news:05145b48-1fe5-4bfe...@t20g2000yqa.googlegroups.com...
[...]

> If you read my poems carefully and understood them,
> they will make you better and smarter..

LOL!

[...]

Andrey Marochko

unread,

Sep 9, 2010, 6:46:35 AM9/9/10

to

Doing this kind of partitioning is somewhat tedious at best, and error
prone as a rule. Instead you could use one of the work-stealing based
frameworks like Intel TBB (http://www.threadingbuildingblocks.org) or
Cilk if you use Intel compiler. E.g. with TBB you can just replace
usual for loops with parallel_for functions, and the library will do
partitioning for you. Work-stealing ensures the absence of
oversubscription and optimal load balance across worker threads.

clariss...@gmail.com

unread,

Oct 19, 2012, 3:05:23 PM10/19/12

to

Many thanks