Parallel.Invoke

Clive Tooth

unread,

Mar 12, 2012, 12:18:24 PM3/12/12

to

I have an array of Actions which I kick off with a Parallel Invoke...

Action[] workers = new Action[numThreads];
...
Parallel.Invoke(workers);

Can I have an array of Action<int> s which I somehow kick off with an
array of parameters?

Action<int>[] workers = new Action<int>[numThreads];
...
Parallel.Invoke(workers, arrayOfParameters); // ... or somesuch

?

TIA

--
Clive Tooth

Peter Duniho

unread,

Mar 12, 2012, 1:45:14 PM3/12/12

to

You could use Parallel.For(), capturing the array and using the For() index
to retrieve the parameter value?

For example:

int[] array = { 1, 1, 2, 3, 5, 8, 13 };
Action<int>[] actions = ...;

Parallel.For(0, array.Length - 1, i => actions[i](array[i]));

However, note that if you are constructing new Action<int> instances for
each index anyway, then you could have just embedded the correct argument
for each instance in the delegate instance itself (i.e. through variable
capturing).

Doing it in the way shown above seems like superfluous work to me.

A more maintainable design would be to have just a single delegate instance
that works according to an integer value passed in. This can either be
just an index:

Parallel.For(0, array.Length - 1, i => action(array[i]));

or the actual integer value from your original array:

Parallel.ForEach(array, i => action(i));

I think either of those approaches is significantly simpler and more
maintainable than making each delegate that you invoke actually be
different.

Pete

Marcel Müller

unread,

Mar 12, 2012, 5:59:48 PM3/12/12

to

Feel free to inject your parameters into the workers.

Action<int>[] workers = new Action<int>[numThreads];
...

Action[] workers2 = new Action[numThreads];
for (int i = 0; i < numThreads; ++i)
workers2[i] = () => workers[i](arrayOfParameters[i]);

Parallel.Invoke(workers2);

Marcel

Peter Duniho

unread,

Mar 12, 2012, 7:14:10 PM3/12/12

to

Careful. The above will crash. The variable "i" that's being capture will,
by the even the first delegate invocation is attempted, have the value of
"numThreads", causing an IndexOutOfRangeException when dereferencing the
"workers" array (and would when dereferencing the "arrayOfParameters"
array, if the first exception didn't happen...or maybe it's the other way
around; I can't recall off the top of my head for sure).

If you're going to do it that way (it's equivalent to my second suggestion:
"[embed] the correct argument for each instance in the delegate instance
itself (i.e. through variable capturing)"), you need to write correct code:

for (int i = 0; i < numThreads; i++)
{
int j = i;

workers2[i] = () => workers[j](arrayOfParameters[j]);
}

Then each delegate instance captures a new "j" instance, which only ever
has a single value: the value that "i" had at the time of capture.

Pete

Peter Duniho

unread,

Mar 12, 2012, 8:10:57 PM3/12/12

to

Sorry...I just have to fix these really awful typos:

On Mon, 12 Mar 2012 16:14:10 -0700, Peter Duniho wrote:

> [...] The variable "i" that's being capture will,
> by the even the first delegate invocation is attempted...

Should be:

The variable "i" that's being captured will, by the time even the first
delegate invocation is attempted...

I think it was clear enough in the original form, and I could have put up
with the missing "d", but I can't stand having left out an entire word! :(

Marcel Müller

unread,

Mar 13, 2012, 4:54:09 AM3/13/12

to

On 13.03.2012 00:14, Peter Duniho wrote:
> If you're going to do it that way (it's equivalent to my second suggestion:
> "[embed] the correct argument for each instance in the delegate instance
> itself (i.e. through variable capturing)"), you need to write correct code:
>
> for (int i = 0; i< numThreads; i++)
> {
> int j = i;
>
> workers2[i] = () => workers[j](arrayOfParameters[j]);
> }
>
> Then each delegate instance captures a new "j" instance, which only ever
> has a single value: the value that "i" had at the time of capture.

You are rioght! I went into that common pitfall.

Usually I do not use for(int i... for this purpose but a generalized
sequence join as LINQ expression. But this is not part of the standard
runtime.

Marcel

Clive Tooth

unread,

Mar 13, 2012, 7:33:12 AM3/13/12

to

Thanks for that, Peter and Marcel. I have changed the program slightly
to use variable capture.

Here is some background to what I am actually doing...

Over 40 years ago I became interested in the problem of finding powers
of two which have runs of several consecutive zeros in their decimal
representation. I had access to a Burroughs B6500 computer and I wrote
a rather complicated Algol program to look for a run of 9 consecutive
zeros. I found them in 2^100,823. Many further runs of the program
failed to get as far as 2^559,940 which contains 10 consecutive zeros.
Over the decades I have revisited the problem a few times.

See
https://oeis.org/A006889

Anyway, a month or so ago I got a new computer... Hmmm... what can I
do with it? :)

Currently it is known that 2^918,583,174 has a run of 17 consecutive
zeros. So I thought I would have a look for 18 zeros.

My program stores the current power of two in base one billion - with
each base one billion "digit" being a 32-bit integer. I will refer to
these base billion digits as ints. Each int is equivalent to nine base
ten digits. Currently the program is up to about 2^922,000,000 - which
requires about 31 million ints. The basic loop is to double the
current number and then look to see if there is a run of 18 decimal
zeros. This process is facilitated by the fact that if there is a run
of 18 (or even 17) decimal zeros then there must be a zero int.

The program partitions the 31 million ints into n (nearly) equal
chunks and Parallel.Invokes a worker for each chunk. [n is, in fact,
eight. The computer has 8 logical processors (thanks to
hyper-threading) but I arrived at this value for n by
experimentation.] Each worker doubles its chunk and looks out for zero
ints. The main program then resolves the issue of the carry out of the
top of one chunk into the bottom of the next chunk and has a closer
look around the zero ints to see if we have our 18 zero decimal digits
yet. Every 20 minutes the program produces a restart file (using
serialization).

Before producing a restart file, and after restarting from one, the
program divides 2^currentpower by a ten-digit prime, P, and stores the
remainder. It then works out what 2^currentpower mod P should be [this
can be done quickly] and compares this answer with the reminder. If
those two numbers are different we have a problem.

At the moment the program is running at about 28 powers of two per
second. I will probably need a few years to find 18 zeros.

--
Clive Tooth

Peter Duniho

unread,

Mar 13, 2012, 12:52:13 PM3/13/12

to

On Tue, 13 Mar 2012 11:33:12 +0000, Clive Tooth wrote:

> [...] [n is, in fact,

> eight. The computer has 8 logical processors (thanks to
> hyper-threading) but I arrived at this value for n by
> experimentation.]

Just a tip: if you picked your concurrency value empirically, then you have
probably avoided this issue. However, note that on some Hyperthreading
processors, the virtual cores share a cache, and parallel performance can
actually be drastically _worse_ than serial performance, due to multiple
threads accessing the same cache. This is especially easy to run into on
highly parallelizable computational problems such as yours.

I ran into this several years ago, not long after Intel first started using
HT. The issue occurred because my multiple threads kept working data on
the stack, and each thread had such similar execution profiles that the
stack-bound data wound up in the same place in the cache. Since cache
lines are determined by the modulo of the memory address for the data, and
since each thread started with their stack at a memory address that was an
even multiple of the cache size, each thread was using data that wound up
in the same place in the cache.

This conflict results in an enormous degree of cache misses, because each
thread was constantly invalidating the cache lines that had been
initialized by the other.

As long as you've empirically determined that your concurrent performance
appears to be a rough scaling of the serial performance (i.e. with 8
threads, you're getting something reasonably close to 8x throughput), then
you've avoided this issue. In fact, while I haven't studied the newer HT
processors closely, I have read that Intel actually has worked to side-step
the issue or at least mitigate it. It might be harder to even reproduce on
the modern HT processors. When I ran into the issue, it was on a PC with
one of the original HT CPUs.

But I figured I'd mention it, just in case. :)

Pete

Clive Tooth

unread,

Mar 13, 2012, 7:53:23 PM3/13/12

to

On Tue, 13 Mar 2012 09:52:13 -0700, Peter Duniho
<NpOeS...@NnOwSlPiAnMk.com> wrote:

>As long as you've empirically determined that your concurrent performance
>appears to be a rough scaling of the serial performance (i.e. with 8
>threads, you're getting something reasonably close to 8x throughput), then
>you've avoided this issue. In fact, while I haven't studied the newer HT
>processors closely, I have read that Intel actually has worked to side-step
>the issue or at least mitigate it. It might be harder to even reproduce on
>the modern HT processors. When I ran into the issue, it was on a PC with
>one of the original HT CPUs.

With 8 threads the program handles about 30 powers of two per second,
and each power of two would contain about 280 million decimal digits.
So, the metric that I use is "picoseconds per digit". The program
computes and displays various statistics every three seconds.

Here is a table of threads against ps/digit:

Th. ps/digit
1 633

2 318

4 193

7 125
8 114
9 162

So, using 8 threads is about 5.6 times faster than using 1 thread.

I now have a compile-time option to use a traditional threading
approach or the Parallel.Invoke method. In terms of speed they are
virtually identical. The whole program runs at "Below Normal"
priority.

--
Clive Tooth

Peter Duniho

unread,

Mar 13, 2012, 9:37:45 PM3/13/12

to

On Tue, 13 Mar 2012 23:53:23 +0000, Clive Tooth wrote:

> [...]

> Here is a table of threads against ps/digit:
>
> Th. ps/digit
> 1 633
>
> 2 318
>
> 4 193
>
> 7 125
> 8 114
> 9 162
>
> So, using 8 threads is about 5.6 times faster than using 1 thread.

What happened to threads 3 and 5?

Also, based on those numbers it appears that at least two threads are
lagging behind. I suppose in one of those threads, it might be because
that's the one that is inspecting a value more closely, when it's flagged
by another thread as a potential candidate. But that still leaves the
other thread.

I note in your previous description that you have partitioned the work by
digits. It seems to me that would be a reasonable design if you just
wanted to process a single number as quickly as possible. But in this
case, it seems to me that it needlessly complicates the logic and adds
inter-thread overhead.

I would expect throughput to improve noticeably if you just keep all the
processing for any given number (power of 2) in a single thread. Then each
thread can basically proceed in complete isolation practically indefinitely
(interrupted only by housekeeping duties such as saving state and printing
statistics), rather than having to synchronize with the other threads
dozens of times each second.

Pete

Clive Tooth

unread,

Mar 14, 2012, 6:47:48 AM3/14/12

to

Hi Pete,

Here is a more complete table of threads/ps-per-digit...

Th. ps/digit
1 633
2 318

3 223
4 193
5 165
6 142

7 125
8 114
9 162

Thanks *very* much for your idea of keeping all the processing for a
given power of two in just one thread - I will certainly pursue this.
That idea just had not occurred to me :(

When I thought about your idea my first notion was to have 8 threads
and each thread multiplying its power of 2 by 2^8. Each base-billion
digit would just be bit shifted to do the multiplication but then I
realised that a (64-bit) div and mod would be required to compute each
carry - so that is probably not the way to go.

A better idea is probably for each thread to work on, for example, one
million consecutive powers of two. When it had finished its range
(which would probably take 2 or 3 days) it would be told the starting
point for its next million. To work out the base-billion value of its
starting power of two the thread will communicate with the Mathematica
kernel running on my computer. That computation only takes Mathematica
about 5 minutes, so it is not too great an overhead.

The controlling thread will still compute an overall
picoseconds-per-digit figure, so we can see if the new method is an
improvement :)

--
Clive Tooth

Clive Tooth

unread,

Mar 14, 2012, 6:04:50 PM3/14/12

to

On Wed, 14 Mar 2012 10:47:48 +0000, Clive Tooth <cli...@gmail.com>
wrote:

>A better idea is probably for each thread to work on, for example, one
>million consecutive powers of two. When it had finished its range
>(which would probably take 2 or 3 days) it would be told the starting
>point for its next million. To work out the base-billion value of its
>starting power of two the thread will communicate with the Mathematica
>kernel running on my computer. That computation only takes Mathematica
>about 5 minutes, so it is not too great an overhead.
>
>The controlling thread will still compute an overall
>picoseconds-per-digit figure, so we can see if the new method is an
>improvement :)

I tried giving out distinct ranges of powers of two to different
threads but the timings were spookily similar to my original
approach...

8 threads: originally 114 ps/digit - new method: 113 ps/dig
7 threads: originally 125 ps/digit - new method: 124 ps/dig

So, I don't think I will use this new method. [And keeping track of
which chunks have been processed would need some careful book-keeping.
We don't want to miss a chunk :) ]

However, I have managed to shave about 10ps off the 114ps by doing
something else...

Here is the inner loop of the program:

unchecked
{
for (int i = startI; i < stopI; i++)
{
if (a[i] >= half)
{
a[i] = (a[i]<<1) - myBase | carry;
carry = 1;
}
else
{
a[i] = (a[i]<<1) | carry;
carry = 0;
}
if (a[i] == 0)
{
if (zeroCount >= HelpInfo.zerosMax)
{
U.GiveUp("Scanner problem: zeroCount >= HelpInfo.zerosMax");
}
zeros[zeroCount] = i;
++zeroCount;
}
} // i loop
} // unchecked

It struck me that if I moved the "if (a[i] == 0) ..." to just after
the "a[i] = ..." the compiler might be able to do an optimisation
involving not re-fetching a[i]. And that does indeed seem to be the
case.

So the inner loop now reads...

unchecked
{
for (int i = startI; i < stopI; i++)
{
if (a[i] >= half)
{
a[i] = (a[i]<<1) - myBase | carry;
if (a[i] == 0)
{
if (zeroCount >= HelpInfo.zerosMax)
{
U.GiveUp("Scanner problem: zeroCount >= HelpInfo.zerosMax");
}
zeros[zeroCount] = i;
++zeroCount;
}
carry = 1;
}
else
{
a[i] = (a[i]<<1) | carry;
if (a[i] == 0)
{
if (zeroCount >= HelpInfo.zerosMax)
{
U.GiveUp("Scanner problem: zeroCount >= HelpInfo.zerosMax");
}
zeros[zeroCount] = i;
++zeroCount;
}
carry = 0;
}
} // i loop
} // unchecked

And the time is down to about 104 picoseconds per digit.

--
Clive Tooth

Peter Duniho

unread,

Mar 14, 2012, 8:39:44 PM3/14/12

to

On Wed, 14 Mar 2012 22:04:50 +0000, Clive Tooth wrote:

> On Wed, 14 Mar 2012 10:47:48 +0000, Clive Tooth <cli...@gmail.com>
> wrote:
>
>>A better idea is probably for each thread to work on, for example, one
>>million consecutive powers of two. When it had finished its range
>>(which would probably take 2 or 3 days) it would be told the starting
>>point for its next million. To work out the base-billion value of its
>>starting power of two the thread will communicate with the Mathematica
>>kernel running on my computer. That computation only takes Mathematica
>>about 5 minutes, so it is not too great an overhead.
>>
>>The controlling thread will still compute an overall
>>picoseconds-per-digit figure, so we can see if the new method is an
>>improvement :)
>
> I tried giving out distinct ranges of powers of two to different
> threads but the timings were spookily similar to my original
> approach...
>
> 8 threads: originally 114 ps/digit - new method: 113 ps/dig
> 7 threads: originally 125 ps/digit - new method: 124 ps/dig

Impossible to know without the full program. However, with such a small
difference in performance I suspect that you may still be synchronizing
threads too often. You mentioned before that "The controlling thread will
still compute an overall picoseconds-per-digit figure." If you are
updating those statistics too frequently (e.g. more than once every second
or, even better, once every few seconds), then you'll run into similar
synchronization overhead as before, resulting in similar timings as before.

> So, I don't think I will use this new method. [And keeping track of
> which chunks have been processed would need some careful book-keeping.
> We don't want to miss a chunk :) ]

Bookkeeping for the method I suggested should be simpler, not more
complicated. After all, you only need to assign a power to two for a given
chunk, whereas before you needed to keep track of which digits were being
processed by each thread for which power of two.

> [...]

> It struck me that if I moved the "if (a[i] == 0) ..." to just after
> the "a[i] = ..." the compiler might be able to do an optimisation
> involving not re-fetching a[i]. And that does indeed seem to be the
> case.

Likely even better would be to keep "a[i]" in a temp variable until the end
of the loop, avoiding the entire questions of optimizations,
synchronization (which the CPU worries about even if you don't), etc.

Pete