Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

About my ParallelFor() that scales very well

9 views
Skip to first unread message

Horizon68

unread,
Jul 5, 2019, 5:50:55 PM7/5/19
to
Hello,



About my ParallelFor() that scales very well that uses my efficient
Threadpool that scales very well:

With ParallelFor() you have to:

1- Ensure Sufficient Work

Each iteration of a loop involves a certain amount of work,
so you have to ensure a sufficient amount of the work,
read below about "grainsize" that i have implemented.

2- In OpenMP we have that:

Static and Dynamic Scheduling

One basic characteristic of a loop schedule is whether it is static or
dynamic:

• In a static schedule, the choice of which thread performs a particular
iteration is purely a function of the iteration number and number of
threads. Each thread performs only the iterations assigned to it at the
beginning of the loop.

• In a dynamic schedule, the assignment of iterations to threads can
vary at runtime from one execution to another. Not all iterations are
assigned to threads at the start of the loop. Instead, each thread
requests more iterations after it has completed the work already
assigned to it.


But with my ParallelFor() that scales very well, since it is using my
efficient Threadpool that scales very well, so it is using Round-robin
scheduling and it uses also work stealing, so i think that this is
sufficient.

Read the rest:

My Threadpool engine with priorities that scales very well is really
powerful because it scales very well on multicore and NUMA systems, also
it comes with a ParallelFor() that scales very well on multicores and
NUMA systems.

You can download it from:

https://sites.google.com/site/scalable68/an-efficient-threadpool-engine-with-priorities-that-scales-very-well


Here is the explanation of my ParallelFor() that scales very well:

I have also implemented a ParallelFor() that scales very well, here is
the method:

procedure ParallelFor(nMin, nMax:integer;aProc:
TParallelProc;GrainSize:integer=1;Ptr:pointer=nil;pmode:TParallelMode=pmBlocking;Priority:TPriorities=NORMAL_PRIORITY);

nMin and nMax parameters of the ParallelFor() are the minimum and
maximum integer values of the variable of the ParallelFor() loop, aProc
parameter of ParallelFor() is the procedure to call, and GrainSize
integer parameter of ParallelFor() is the following:

The grainsize sets a minimum threshold for parallelization.

A rule of thumb is that grainsize iterations should take at least
100,000 clock cycles to execute.

For example, if a single iteration takes 100 clocks, then the grainsize
needs to be at least 1000 iterations. When in doubt, do the following
experiment:

1- Set the grainsize parameter higher than necessary. The grainsize is
specified in units of loop iterations.

If you have no idea of how many clock cycles an iteration might take,
start with grainsize=100,000.

The rationale is that each iteration normally requires at least one
clock per iteration. In most cases, step 3 will guide you to a much
smaller value.

2- Run your algorithm.

3- Iteratively halve the grainsize parameter and see how much the
algorithm slows down or speeds up as the value decreases.

A drawback of setting a grainsize too high is that it can reduce
parallelism. For example, if the grainsize is 1000 and the loop has 2000
iterations, the ParallelFor() method distributes the loop across only
two processors, even if more are available.

And you can pass a parameter in Ptr as pointer to ParallelFor(), and you
can set pmode parameter of to pmBlocking so that ParallelFor() is
blocking or to pmNonBlocking so that ParallelFor() is non-blocking, and
the Priority parameter is the priority of ParallelFor(). Look inside the
test.pas example to see how to use it.




Thank you,
Amine Moulay Ramdane.


0 new messages