My thread pool class seems to be 5x faster than Qt's QThreadPool class...

Mr Flibble

unread,

Aug 29, 2017, 3:38:20 PM8/29/17

to

My thread pool class seems to be 5x faster than Qt's QThreadPool class...

Qt QThreadPool test case:

struct task : public QRunnable
{
int i;
std::vector<int>& v;
task(int i, std::vector<int>& v) : i{i}, v{v} {}

void run()
{
v[i] = i;
}
};

int main(int argc, char *argv[])
{
QThreadPool threadPool;
std::vector<int> v;
v.resize(100000);
std::chrono::steady_clock::time_point begin =
std::chrono::steady_clock::now();
for (int i = 0; i < 100000; ++i)
threadPool.start(new task{i, v});
threadPool.waitForDone();
std::chrono::steady_clock::time_point end =
std::chrono::steady_clock::now();
std::set<int> s;
for (auto n : v)
s.insert(n);
std::cout << "\ns: " << s.size() << "\ntime: " <<
std::chrono::duration_cast<std::chrono::milliseconds>(end -
begin).count() << "ms" << std::endl;

}

My neolib::thread_pool test case:

int main()
{
neolib::thread_pool threadPool;
std::vector<int> v;
v.resize(100000);
std::chrono::steady_clock::time_point begin =
std::chrono::steady_clock::now();
for (int i = 0; i < 100000; ++i)
threadPool.run([i, &v]()
{
v[i] = i;
});
threadPool.wait();
std::chrono::steady_clock::time_point end =
std::chrono::steady_clock::now();
std::set<int> s;
for (auto n : v)
s.insert(n);
std::cout << "\ns: " << s.size() << "\ntime: " <<
std::chrono::duration_cast<std::chrono::milliseconds>(end -
begin).count() << "ms" << std::endl;
return 0;
}

Timing results:
QThreadPool: 1051ms
neolib::thread_pool: 222ms

Not only is my thread pool class easier to use (lambdas instead of
deriving of classes from QRunnable) it also appears to have
significantly better performance.

Game on for neoGFX being serious competition for Qt... :D

/Flibble

Christian Gollwitzer

unread,

Aug 30, 2017, 12:58:54 AM8/30/17

to

Am 29.08.17 um 21:37 schrieb Mr Flibble:

> Timing results:
> QThreadPool: 1051ms
> neolib::thread_pool: 222ms

Interesting. How about the compiler built-in solution:

#pragma omp parallel for schedule(dynamic)
// of course, schedule(guided) or schedule(static) will give much better
// performance, but schedule(dynamic) is more closely to what you
// presumably do - start a thread for each loop iteration
for (...) { ...}

and compile with OpenMP (/openmp on VC++, -fopenmp on gcc)?

> Not only is my thread pool class easier to use (lambdas instead of
> deriving of classes from QRunnable) it also appears to have
> significantly better performance.
>
> Game on for neoGFX being serious competition for Qt... :D

:D As soon as you have an HTML widget and a DB abstraction layer you can
try to compete....

Christian

Mr Flibble

unread,

Aug 30, 2017, 2:01:38 AM8/30/17

to

If I want an HTML widget then it is a simple matter of making WebKit a
dependency just like Qt does but I disagree that I need to provide a like
for like of every single Qt feature to be able to compete.

/Flibble

David Brown

unread,

Aug 30, 2017, 2:43:16 AM8/30/17

to

On 29/08/17 21:37, Mr Flibble wrote:
> My thread pool class seems to be 5x faster than Qt's QThreadPool class...
>

<snip>

>
> Not only is my thread pool class easier to use (lambdas instead of
> deriving of classes from QRunnable) it also appears to have
> significantly better performance.

How many threads does each version have in its pool? That might make a
difference.

In normal use, the functions passed to the thread pool are going to be
longer running (if not, they why bother with the threading?), so you are
not going to see such a big difference. Still, a 5x reduction in the
overhead is not insignificant.

For me, it is the use of lambdas (or presumably functors or anything
else that looks like a function) rather than a derived class that makes
your pool a nicer and more modern solution. That's not just doing the
same thing a bit faster - it is a big step up in ease of use.

Well done!

Paavo Helde

unread,

Aug 30, 2017, 7:10:04 AM8/30/17

to

On 29.08.2017 22:37, Mr Flibble wrote:
> My thread pool class seems to be 5x faster than Qt's QThreadPool class...

> for (int i = 0; i < 100000; ++i)

> Timing results:
> QThreadPool: 1051ms
> neolib::thread_pool: 222ms

So the task launching overhead is 10 ns with Qt and 2 ns with neolib.
Not something which I would lose my sleep over, to be honest, but still
interesting.

Have you profiled this and found out the reasons? Is it because of
avoiding a virtual call and the resulting inlining, or is it because of
faster synchro primitives?

Mr Flibble

unread,

Aug 30, 2017, 10:58:01 AM8/30/17

to

You failed to provide "better design" as a reason.

/Flibble

Chris M. Thomasson

unread,

Aug 30, 2017, 6:30:31 PM8/30/17

to

Please try to forgive my ignorance, but at a brief glance, it seems the
lambda is faster in your version vs the "undercover" virtual call to run
in QT? I am also thinking of massive contention wrt the underlying
memory allocator in the QT version that uses an explicit call to new...

100,000 threads is extreme! However, seems to get a "point" across. Try
to create some array hybrids before creating shi%loads of linked
data-structures where each node needs an explicit call to new under the
contention of a loaded system! Use clean ingredients to make the damn
sausages! Not a heap of ground up crap from a 1,000+ different possibly
diseased cows trying to compress together in a coherent package.

> Not only is my thread pool class easier to use (lambdas instead of
> deriving of classes from QRunnable) it also appears to have
> significantly better performance.
>
> Game on for neoGFX being serious competition for Qt... :D

Sounds good to me. :^)

Sorry if my comments are misrepresenting you. ;^/

Mr Flibble

unread,

Aug 30, 2017, 6:58:59 PM8/30/17

to

Nope, I also allocate a task object under the covers using new and it
has a virtual function.

>
> 100,000 threads is extreme! However, seems to get a "point" across. Try

Nope, there are 100,000 tasks not a 100,000 threads; by default my
thread pool creates N threads where N is number of CPU cores and I
believe QThreadPool does the same.

> to create some array hybrids before creating shi%loads of linked
> data-structures where each node needs an explicit call to new under the
> contention of a loaded system! Use clean ingredients to make the damn
> sausages! Not a heap of ground up crap from a 1,000+ different possibly
> diseased cows trying to compress together in a coherent package.

Sorry I failed to parse/understand any of that.

>
>
>> Not only is my thread pool class easier to use (lambdas instead of
>> deriving of classes from QRunnable) it also appears to have
>> significantly better performance.
>>
>> Game on for neoGFX being serious competition for Qt... :D
>
> Sounds good to me. :^)
>
> Sorry if my comments are misrepresenting you. ;^/

Your comments are erroneous if that is what you mean.

/Flibble

Chris M. Thomasson

unread,

Aug 30, 2017, 7:53:57 PM8/30/17

to

Are these calls to raw new? No special override to a custom allocator?

>> 100,000 threads is extreme! However, seems to get a "point" across. Try
>
> Nope, there are 100,000 tasks not a 100,000 threads; by default my
> thread pool creates N threads where N is number of CPU cores and I
> believe QThreadPool does the same.

100,000 async individual state-machines multiplexed by N threads is
workable, and can be scaleable.

>> to create some array hybrids before creating shi%loads of linked
>> data-structures where each node needs an explicit call to new under
>> the contention of a loaded system! Use clean ingredients to make the
>> damn sausages! Not a heap of ground up crap from a 1,000+ different
>> possibly diseased cows trying to compress together in a coherent package.
>
> Sorry I failed to parse/understand any of that.

Try to create a nice sized single allocation of properly aligned and
padded memory before any threads and/or tasks are created. This can help
reduce some calls to new during a tasks lifetime. The alignment and
padding can help get rid of false sharing.

>>> Not only is my thread pool class easier to use (lambdas instead of
>>> deriving of classes from QRunnable) it also appears to have
>>> significantly better performance.
>>>
>>> Game on for neoGFX being serious competition for Qt... :D
>>
>> Sounds good to me. :^)
>>
>> Sorry if my comments are misrepresenting you. ;^/
>
> Your comments are erroneous if that is what you mean.

Did I do any better in this message, or worse? Trying to understand why
you got your speed up. Better use of data-structures wrt alignment,
layout and logic?

Chris M. Thomasson

unread,

Aug 30, 2017, 8:14:35 PM8/30/17

to

Should explicitly point out that the pre-thread/task allocation would be
used for task structures themselves and for allocation needs during
their lifetime. When this pre-allocated pool runs out, then another call
to new can be performed to allocate another large chunk of memory. We
amortize calls to new.