Split up vector for concurrent processing

bitrex

unread,

Oct 9, 2017, 9:28:46 PM10/9/17

to

How would I go about accomplishing the following, which seems like
something one might want to do regularly in a multi-threaded
data-crunching application:

Take a vector of some type where the required algorithm can be applied
element-wise and doesn't depend on any of the other values, split into N
chunks (ideally N = number of cores * threads per core), send off copies
to the worker threads and then recombine the result in a new vector in
the same order after completion. Or use iterators to transform the
original vector in place from different threads, if that's possible?

red floyd

unread,

Oct 10, 2017, 12:22:26 AM10/10/17

to

Run through the vector, and pass a reference (or pointer) to each
element to a new thread for processing?

PSEUDOCODE:

for (T& elem: v)
spawn_thread(some_function, T);

Christian Gollwitzer

unread,

Oct 10, 2017, 1:29:01 AM10/10/17

to

Am 10.10.17 um 03:28 schrieb bitrex:

Use OpenMP. It does most of that for you:

std::vector<double> new(old.size());

#pragma omp parallel for
// this pragma does almost exactly what you describe,
// except it doesn't copy the input vector

for (size_t i=0; i < old.size(); i++) {
new[i] = old[i]*2;
}

// caveat: some OpenMP implementations do not accept unsigned types
// then maybe replace size_t by intptr_t and ignore the
// comparison between signed and unsigned warning

Compile with openmp enabled (-fopenmp for gcc or /openmp for Visual C++)

Christian

red floyd

unread,

Oct 10, 2017, 3:23:20 AM10/10/17

to

Much better than mine, assuming he has OpenNP.

Does OpenMP work on a single system? Or does it need to hand off to a
another node in a cluster? I haven't looked at it in years.

David Brown

unread,

Oct 10, 2017, 4:32:48 AM10/10/17

to

If you have C++17, you can try the "execution policies":

<http://en.cppreference.com/w/cpp/algorithm>

Jorgen Grahn

unread,

Oct 10, 2017, 4:41:37 AM10/10/17

to

I think I'd try this design:

- Threads with an input queue and an output queue, like a Unix filter
except maybe without flow control.
- Pools of these.
- An abstraction which need not be thread-aware but can:
- chop up a container into N pieces
- accept "tagged" chunks of data, gather them into a destination
container, and flag "done" when it has all pieces matching the
source container. E.g. insert [10..12); insert [0..5); and lastly
insert [5..10) and then it's done because the original container
was [0..12) chopped up in three pieces.
- do this without too much copying

Although thinking a bit further, this is a bit like TCP: the sender
chops the stream up into segments, the receiver assembles them into a
stream, and preserves order.

A possibly infinite stream seems like a better abstraction for general
use for two reasons:
- You may want to process the first elements even if all of them aren't
ready yet.
- You'll have idle threads when you're near the end of the container;
utilization is lower than it perhaps could be.

Overkill for many uses, I'm sure.

Disclaimer: I don't do a lot of thread programming, and I didn't learn
the C++11 stuff.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Scott Lurndal

unread,

Oct 10, 2017, 8:40:20 AM10/10/17

to

Use an autovectorizing compiler and the host SIMD instruction set?

bitrex

unread,

Oct 10, 2017, 9:43:36 AM10/10/17

to

On 10/10/2017 09:14 AM, Stefan Ram wrote:

> bitrex <bit...@de.lete.earthlink.net> writes:
>> element-wise and doesn't depend on any of the other values, split into N
>> chunks (ideally N = number of cores * threads per core), send off copies
>> to the worker threads and then recombine the result in a new vector in
>

> Here is something similar (based on code by Bjarne Stroustrup):
> Calculate the sum of a vector in parallel.
>
> #include <algorithm>
> #include <future>
> #include <initializer_list>
> #include <iostream>
> #include <ostream>
> #include <thread>
> #include <utility>
> #include <vector>
>
> double sum( double const * const beginning, double const * const end )
> { return ::std::accumulate( beginning, end, 0.0 ); }
>
> double sum_in_parallel( ::std::vector< double > const & vector )
> { using task_type = double( double const *, double const * );
> ::std::packaged_task< task_type >package0{ sum };
> ::std::packaged_task< task_type >package1{ sum };
> ::std::future< double >future0{ package0.get_future() };
> ::std::future< double >future1{ package1.get_future() };
> double const * const p = &vector[ 0 ];
> { auto len { vector.size };
> ::std::thread thread0{ ::std::move( package0 ), p, p + len/2, 0 };
> ::std::thread thread1{ ::std::move( package0 ), p + len/2, p + len, 0 }; }
> return future0.get() + future1.get(); }
>
> int main()
> { ::std::vector const vector< double >{};
> ::std::cout << sum_in_parallel( vector )<< '\n'; }
>
> But I cannot get it compiled:
>
> error: variable 'std::packaged_task<double(const double*, const double*)> package0'
> has initializer but incomplete type
> ::std::packaged_task< task_type >package0{ sum };
> ^
> . Maybe someone can explain how to remove the error?
> (Is it my compiler not supporting all of the library?)
>

Nice!!

bitrex

unread,

Oct 10, 2017, 9:48:01 AM10/10/17

to

I think recent versions of GCC should optimize for SIMD at -O3? In the
Code::Blocks build options I also see "CPU Architecture Tuning" flags
for AMD FX-64, Intel Core, etc...

Scott Lurndal

unread,

Oct 10, 2017, 10:20:43 AM10/10/17

to

r...@zedat.fu-berlin.de (Stefan Ram) writes:
> Distribution through any means other than regular usenet
>
> channels is forbidden. It is forbidden to publish this
>
> article in the world wide web. It is forbidden to change
>
> URIs of this article into links. It is forbidden to remove
>
> this notice or to transfer the body without this notice.
>X-No-Archive: Yes
>Archive: no
>X-No-Archive-Readme: "X-No-Archive" is only set, because this prevents some
>
> services to mirror the article via the web (HTTP). But Stefan Ram
>
> hereby allows to keep this article within a Usenet archive server
>
> with only NNTP access without any time limitation.
>X-No-Html: yes
>Content-Language: en
>X-Received-Body-CRC: 1719239867
>X-Received-Bytes: 3178

>
>bitrex <bit...@de.lete.earthlink.net> writes:
>>element-wise and doesn't depend on any of the other values, split into N
>>chunks (ideally N = number of cores * threads per core), send off copies
>>to the worker threads and then recombine the result in a new vector in
>

> Here is something similar (based on code by Bjarne Stroustrup):
> Calculate the sum of a vector in parallel.
>
>#include <algorithm>
>#include <future>
>#include <initializer_list>
>#include <iostream>
>#include <ostream>
>#include <thread>
>#include <utility>
>#include <vector>
>
>double sum( double const * const beginning, double const * const end )
>{ return ::std::accumulate( beginning, end, 0.0 ); }
>
>double sum_in_parallel( ::std::vector< double > const & vector )
>{ using task_type = double( double const *, double const * );
> ::std::packaged_task< task_type >package0{ sum };
> ::std::packaged_task< task_type >package1{ sum };
> ::std::future< double >future0{ package0.get_future() };
> ::std::future< double >future1{ package1.get_future() };
> double const * const p = &vector[ 0 ];
> { auto len { vector.size };
> ::std::thread thread0{ ::std::move( package0 ), p, p + len/2, 0 };
> ::std::thread thread1{ ::std::move( package0 ), p + len/2, p + len, 0 }; }
> return future0.get() + future1.get(); }

Completely unreadable.

>
>int main()
>{ ::std::vector const vector< double >{};
> ::std::cout << sum_in_parallel( vector )<< '\n'; }
>
> But I cannot get it compiled:

Not surprising.

bitrex

unread,

Oct 10, 2017, 10:22:44 AM10/10/17

to

On 10/10/2017 09:52 AM, Stefan Ram wrote:
> r...@zedat.fu-berlin.de (Stefan Ram) writes:

>> r...@zedat.fu-berlin.de (Stefan Ram) writes:
>>> { ::std::vector const vector< double >{};

>> That should be
>> ::std::vector< double >const vector {};
>> . (But the error reported still remains.)
>
> Oh, and

>
> ::std::thread thread1{ ::std::move( package0 ), p + len/2, p + len, 0 };
>

> should be
>
> ::std::thread thread1{ ::std::move( package1 ), p + len/2, p + len, 0 };
>
> . I was not able to start this program, so I was not able to
> debug it. But »move« helped me to spot the error, because
> when reading,

>
> ::std::thread thread0{ ::std::move( package0 ), p, p + len/2, 0 };
> ::std::thread thread1{ ::std::move( package0 ), p + len/2, p + len, 0 };
>

> , it is clear that on can move from »package0« only once.
>
>

Also the function that's packaged takes two doubles as arguments, but in
the "thread0" and "thread1" constructor the author is trying to pass
three not including the rvalue reference to the package. Also I don't
think using raw pointers as indexes into the vector data is such a good
idea.

This compiles with -std=c++11, recent versions of GCC but gives a
"terminate without active exception" on execution - looks like the
threads aren't being joined properly.

#include <future>
#include <initializer_list>
#include <iostream>

#include <thread>
#include <utility>
#include <vector>

#include <numeric>

double sum(double const* const beginning, double const* const end)
{
return ::std::accumulate(beginning, end, 0.0);
}

double sum_in_parallel(const ::std::vector<double>& vector)
{
using task_type = double(double const*, double const*);

::std::packaged_task<task_type> package0{sum};
::std::packaged_task<task_type> package1{sum};
::std::future<double> future0{package0.get_future()};
::std::future<double> future1{package1.get_future()};

double const* const p = &vector[0];
{
auto len{vector.size()};
::std::thread thread0{::std::move(package0), p, p + len / 2};
::std::thread thread1{::std::move(package1), p + len / 2, p + len};

}
return future0.get() + future1.get();
}

int main()
{

const ::std::vector<double> vector{1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0,
8.0};
::std::cout << sum_in_parallel(vector) << ::std::endl;
}

bitrex

unread,

Oct 10, 2017, 10:23:10 AM10/10/17

to

On 10/10/2017 10:22 AM, bitrex wrote:
> On 10/10/2017 09:52 AM, Stefan Ram wrote:
>> r...@zedat.fu-berlin.de (Stefan Ram) writes:
>>> r...@zedat.fu-berlin.de (Stefan Ram) writes:
>>>> { ::std::vector const vector< double >{};
>>> That should be
>>> ::std::vector< double >const vector {};
>>> . (But the error reported still remains.)
>>
>>    Oh, and
>>
>> ::std::thread thread1{ ::std::move( package0 ), p + len/2, p + len,
>> 0 };
>>
>>    should be
>>
>> ::std::thread thread1{ ::std::move( package1 ), p + len/2, p + len,
>> 0 };
>>
>>    . I was not able to start this program, so I was not able to
>>    debug it. But »move« helped me to spot the error, because
>>    when reading,
>>
>> ::std::thread thread0{ ::std::move( package0 ), p, p + len/2, 0 };
>> ::std::thread thread1{ ::std::move( package0 ), p + len/2, p + len, 0 };
>>
>>    , it is clear that on can move from »package0« only once.
>>
>>
>
> Also the function that's packaged takes two doubles as arguments

pointers to doubles, rather

David Brown

unread,

Oct 10, 2017, 10:45:01 AM10/10/17

to

gcc (and other compilers) can do some auto-vectorising. But that is a
very different thing from multi-threading. Auto-vectorising means using
SIMD instructions to do multiple identical operations in parallel within
the one core. Multi-threading (as originally asked) means doing
possibly different operations in multiple threads, preferably on
multiple cores. Both techniques are useful.

To get the most out of auto-vectorising, you need to make sure the
compiler knows the cpu type it is targetting (such as with "-mnative" if
compiling for just your own cpu, but possibly other flags if you want it
to run on a variety of cpus in the same family). You need -O2 or -O3
(or the compiler's equivalent). You may need other flags as well. And
you need to give your compiler as much information as possible - try to
make your loops of constant size, make data aligned suitable for
vectorisation (such as with gcc's "aligned" attribute), etc.

Alain Ketterlin

unread,

Oct 10, 2017, 10:45:29 AM10/10/17

to

bitrex <bit...@de.lete.earthlink.net> writes:

> How would I go about accomplishing the following, which seems like
> something one might want to do regularly in a multi-threaded
> data-crunching application:
>
> Take a vector of some type

"some type" is the crucial factor here.

> where the required algorithm can be applied element-wise and doesn't
> depend on any of the other values, split into N chunks (ideally N =
> number of cores * threads per core), send off copies to the worker
> threads

If you copy pieces of vectors, don't expect significant gains (unless
you have a many cores): memory access is much more costly than mere
arithmetic. Also simultaneous multi-threading (e.g., hyperthreading)
might be detrimental to performance (it all depends on the kind/amount
of data you process: SMT adds pressure on caches). It is very easy to
make a parallel version run slower that the sequential version.

> and then recombine the result in a new vector in the same order after
> completion. Or use iterators to transform the original vector in place
> from different threads, if that's possible?

Your best bet is OpenMP. Use inplace as much as possible. For parallel
loops, adapt the scheduling strategy to the work (im)balance (static if
approximately balanced, dynamic otherwise), and array size (for static
schedule, longer chunks are better). If you use small chunks and you
have small array elements, arrange for the chunks to align on cache-line
sizes to avoid false-sharing.

If you plan to, e.g., sum/... short vectors of int/float/..., give up on
multi-threads and ensure your compiler vectorizes properly; if necessary
rewrite your code so that it does (use whatever options your compiler
provides to spot the problems). Also make sure the compiler targets the
correct architecture (e.g., -march=native with gcc).

If instead you plan to, e.g., apply various filters to large raster
images of various sizes, use OpenMP (and still make sure your compiler
optimizes the sequential part correctly). Then play with scheduling
strategies.

-- Alain.

Christian Gollwitzer

unread,

Oct 10, 2017, 11:32:15 AM10/10/17

to

Am 10.10.17 um 09:23 schrieb red floyd:

> On 10/09/2017 10:28 PM, Christian Gollwitzer wrote:
>>
>> std::vector<double> new(old.size());
>>
>> #pragma omp parallel for
>> // this pragma does almost exactly what you describe,
>> // except it doesn't copy the input vector
>>
>> for (size_t i=0; i < old.size(); i++) {
>> new[i] = old[i]*2;
>> }
>>
>> // caveat: some OpenMP implementations do not accept unsigned types
>> // then maybe replace size_t by intptr_t and ignore the
>> // comparison between signed and unsigned warning

> Much better than mine, assuming he has OpenNP.
>
> Does OpenMP work on a single system? Or does it need to hand off to a
> another node in a cluster? I haven't looked at it in years.

OpenMP only works on shared memory systems, i.e. on a single node with
multiple CPUs. It is available in all major current C++ compilers (gcc,
clang, Intel, Visual). There used to be a discontinued product from
Intel (cluster OpenMP) which used page faults to synchronize the memory
over the cluster, but for today clustering needs different tools (MPI is
the most standard one)

Christian

Scott Lurndal

unread,

Oct 10, 2017, 12:49:09 PM10/10/17

to

For loosely coupled systems, openMPI is the typical answer.

David Brown

unread,

Oct 10, 2017, 1:02:40 PM10/10/17

to

On 10/10/17 15:14, Stefan Ram wrote:
> bitrex <bit...@de.lete.earthlink.net> writes:

>> element-wise and doesn't depend on any of the other values, split into N
>> chunks (ideally N = number of cores * threads per core), send off copies
>> to the worker threads and then recombine the result in a new vector in
>

> Here is something similar (based on code by Bjarne Stroustrup):
> Calculate the sum of a vector in parallel.
>
> #include <algorithm>

> #include <future>
> #include <initializer_list>
> #include <iostream>

> #include <ostream>

> #include <thread>
> #include <utility>
> #include <vector>
>

> double sum( double const * const beginning, double const * const end )
> { return ::std::accumulate( beginning, end, 0.0 ); }
>
> double sum_in_parallel( ::std::vector< double > const & vector )
> { using task_type = double( double const *, double const * );

> ::std::packaged_task< task_type >package0{ sum };
> ::std::packaged_task< task_type >package1{ sum };
> ::std::future< double >future0{ package0.get_future() };
> ::std::future< double >future1{ package1.get_future() };

> double const * const p = &vector[ 0 ];
> { auto len { vector.size };

> ::std::thread thread0{ ::std::move( package0 ), p, p + len/2, 0 };
> ::std::thread thread1{ ::std::move( package0 ), p + len/2, p + len, 0 }; }

> return future0.get() + future1.get(); }
>
> int main()

> { ::std::vector const vector< double >{};

> ::std::cout << sum_in_parallel( vector )<< '\n'; }
>
> But I cannot get it compiled:
>

> error: variable 'std::packaged_task<double(const double*, const double*)> package0'
> has initializer but incomplete type

> ::std::packaged_task< task_type >package0{ sum };

> ^
> . Maybe someone can explain how to remove the error?
> (Is it my compiler not supporting all of the library?)
>

I have tried to keep the structure and logic of your code, while
removing the worst jumbled mess of formatting and the extra includes.
And it is crazy to call your std::vector instance "vector". (I hope you
don't teach your students that weird bracketing style, unusual spacing,
and unnecessary ::std. They are just going to have to unlearn it all
before working with any real-world code.)

Key errors:

1. Messed up type for "p"
2. Using "vector.size" instead of "vector.size()"
3. Extra parameter to your thread initialisers
4. Forgetting to join your threads
5. Using an empty vector for testing!

#include <numeric>
#include <vector>
#include <iostream>
#include <future>

static double sum(const double * const beginning, const double * const end)

{
return std::accumulate(beginning, end, 0.0);
}

static double sum_in_parallel(const std::vector<double> &vect)
{
using task_type = double(const double *, const double *);

std::packaged_task<task_type> package0 { sum };
std::packaged_task<task_type> package1 { sum };
std::future<double> future0 { package0.get_future() };
std::future<double> future1 { package1.get_future() };

const double * p = &vect[0];
const auto len { vect.size() };

std::thread thread0 { std::move(package0), p, p + len / 2 };
std::thread thread1 { std::move(package1), p + len / 2, p + len};

thread0.join();
thread1.join();

return future0.get() + future1.get();
}

int main()
{

const std::vector<double> vect { 1.0, 2.0, 3.0, 4.0 };
std::cout << sum_in_parallel(vect) << '\n';
}

David Brown

unread,

Oct 10, 2017, 1:05:43 PM10/10/17

to

On 10/10/17 19:02, David Brown wrote:

> Key errors:
>
> 1. Messed up type for "p"

Skip that one - I had merely missed out a "const" while copying the code.

red floyd

unread,

Oct 10, 2017, 1:06:41 PM10/10/17

to

On 10/10/2017 9:48 AM, Scott Lurndal wrote:

> For loosely coupled systems, openMPI is the typical answer.
>

*THAT'S* the one I was thinking of. Thanks.

bitrex

unread,

Oct 10, 2017, 1:40:19 PM10/10/17

to

Nice, thank you

bitrex

unread,

Oct 10, 2017, 1:44:41 PM10/10/17

to

Thanks, I'll give OpenMP a look. If I don't have to re-invent the wheel
on this one that'd be great.

bitrex

unread,

Oct 10, 2017, 1:46:43 PM10/10/17

to

On 10/10/2017 10:45 AM, Alain Ketterlin wrote:

> If instead you plan to, e.g., apply various filters to large raster
> images of various sizes, use OpenMP (and still make sure your compiler
> optimizes the sequential part correctly). Then play with scheduling
> strategies.
>
> -- Alain.
>

The use-case I had in mind was basically application of FIR filtering to
data chunks, yes

David Brown

unread,

Oct 10, 2017, 2:29:00 PM10/10/17

to

FIR filtering is a somewhat specialised application. I would guess that
your biggest concern is cache - preloading the cache may make more
difference than running on multiple threads, and if you end up with
contention for data between two threads, it will be slower than a single
thread. But if you can split data nicely, then multiple threads can
work well here.

In any case, you should concentrate first on vectorising SIMD as much as
possible, and only look at multiple threads if you need more speed after
that. You may even consider using a graphics card for the heavy work.

Juha Nieminen

unread,

Oct 11, 2017, 2:52:50 AM10/11/17

to

David Brown <david...@hesbynett.no> wrote:
> To get the most out of auto-vectorising, you need to make sure the
> compiler knows the cpu type it is targetting (such as with "-mnative" if
> compiling for just your own cpu, but possibly other flags if you want it
> to run on a variety of cpus in the same family). You need -O2 or -O3
> (or the compiler's equivalent). You may need other flags as well. And
> you need to give your compiler as much information as possible - try to
> make your loops of constant size, make data aligned suitable for
> vectorisation (such as with gcc's "aligned" attribute), etc.

Even the latest gcc and clang are rather poor at automatic SSE vectorization.
If you really want to get the most out of it, you'll need to resort to
non-portable extensions, eg. the gcc SSE intrinsics. (According to my
experience, optimizing a vectorizable calculation-heavy operation
manually by using SSE instrinsics directly can more than double the
speed of calculation compared to just allowing the compiler to do
it automatically, no matter how you structure your code. Optimizing
instrinsics by hand results in significantly faster code than even
using openmp SIMD pragmas.)

David Brown

unread,

Oct 11, 2017, 7:55:47 AM10/11/17

to

On 11/10/17 08:52, Juha Nieminen wrote:
> David Brown <david...@hesbynett.no> wrote:
>> To get the most out of auto-vectorising, you need to make sure the
>> compiler knows the cpu type it is targetting (such as with "-mnative" if
>> compiling for just your own cpu, but possibly other flags if you want it
>> to run on a variety of cpus in the same family). You need -O2 or -O3
>> (or the compiler's equivalent). You may need other flags as well. And
>> you need to give your compiler as much information as possible - try to
>> make your loops of constant size, make data aligned suitable for
>> vectorisation (such as with gcc's "aligned" attribute), etc.
>
> Even the latest gcc and clang are rather poor at automatic SSE vectorization.

In my testing, clang is more enthusiastic about SSE vectorisation than
gcc - often leading to inefficient code for a "loop n" if "n" is small.

The compilers can do some vectorisation automatically, but they
certainly have their limits. The SSE engines in modern x86 chips have a
great variety, and would be a tough task for a compiler to match normal
C code to these instructions. Simpler constructs, such as for matrix
multiplication, should probably work (if you have the right types, the
right alignment, etc.).

> If you really want to get the most out of it, you'll need to resort to
> non-portable extensions, eg. the gcc SSE intrinsics. (According to my
> experience, optimizing a vectorizable calculation-heavy operation
> manually by using SSE instrinsics directly can more than double the
> speed of calculation compared to just allowing the compiler to do
> it automatically, no matter how you structure your code. Optimizing
> instrinsics by hand results in significantly faster code than even
> using openmp SIMD pragmas.)
>

You can make the code at least somewhat portable by using the
declarations from "x86intrin.h". The same code should (in theory :-) )
work with gcc, clang, MSVC, and Intel's icc.

Jorgen Grahn

unread,

Oct 12, 2017, 1:39:47 PM10/12/17

to

On Tue, 2017-10-10, David Brown wrote:
...

> #include <numeric>
> #include <vector>
> #include <iostream>
> #include <future>
>
> static double sum(const double * const beginning, const double * const end)
> {
> return std::accumulate(beginning, end, 0.0);
> }
>
> static double sum_in_parallel(const std::vector<double> &vect)
> {
> using task_type = double(const double *, const double *);
> std::packaged_task<task_type> package0 { sum };
> std::packaged_task<task_type> package1 { sum };
> std::future<double> future0 { package0.get_future() };
> std::future<double> future1 { package1.get_future() };
> const double * p = &vect[0];
> const auto len { vect.size() };
> std::thread thread0 { std::move(package0), p, p + len / 2 };
> std::thread thread1 { std::move(package1), p + len / 2, p + len};
> thread0.join();
> thread1.join();
> return future0.get() + future1.get();
> }
>
> int main()
> {
> const std::vector<double> vect { 1.0, 2.0, 3.0, 4.0 };
> std::cout << sum_in_parallel(vect) << '\n';
> }

It's funny how strange it feels to read complete, clear, sensibly
formatted code on comp.lang.c++ ...

Christian Gollwitzer

unread,

Oct 12, 2017, 3:50:40 PM10/12/17

to

Am 12.10.17 um 19:39 schrieb Jorgen Grahn:

It is even stranger how this is clear code if you can achieve the same
thing with

#pragma omp parallel for reduction(+:sum)
for (size_t i =0; i < vect.size(); i++) { sum += vect[i]; }

Christian

David Brown

unread,

Oct 13, 2017, 4:46:55 AM10/13/17

to

I don't think Stefan's example was intended to be minimal code for doing
a parallel sum. It was just an example of how to use packaged tasks and
threads in C++. And my code was a correction of his attempt.

OpenMP certainly makes some things easier for multi-threading - but
explicitly controlling the threads as in this sample can have its
advantages too. A tiny example like this will blow the code overheads
out of proportion - in a real world program, the difference will be much
smaller.