Re: [boost] [sort] Parallel sorting sub-library mini-review. Performance tests.

48 views
Skip to first unread message

Christophe Henry

unread,
Nov 20, 2016, 3:24:50 PM11/20/16
to bo...@lists.boost.org
Hi Boost Community,

I did not manage yet to write a full review. One week is a bit short so a
few more days will be necessary. I provide some performance tests first.

To check the library performance, I pitted it against my own versions of a
parallel mergesort and quicksort provided by the asynchronous library (
https://github.com/henry-ch/asynchronous), which is now ready for review,
so we have a stable state for testing.
I also added a parallel version inside the asynchronous library using
Francisco's single-thread implementations to compare with his parallel ones.

I tested on a i7-5960X. Due to time limits, I could not test on more
interesting NUMA platforms (I did not get the reviewed library to compile
for my KNC) so the tests do not pretend to have value outside the i7
architecture.
I linked with tbb_malloc_proxy to limit memory influence.

In a nutshell.
I'm not a big fan of the parallel version of the algorithms. It seems to be
based on std::async, so that a lot of threads are started and joined at
every call. I would suggest using a threadpool.
OTOH the single-threaded ones are interesting, especially the stable_sort
and intro_sort for cases where usage of spreadsort is not possible.

Cheers,
Christophe


A short summary:

100000000 uint64_t elements already sorted:
OMP parallel sort : 0.390899 secs
Boost parallel sort : 0.064965 secs
OMP parallel stable sort : 1.06128 secs
Boost sample sort : 0.0695357 secs
Boost parallel stable sort : 0.0662401 secs

Asynchronous parallel sort : 0.0167134 secs (same with other
algorithms)

Asynchronous provides a special optimization for this case.


I added this one:
100000000 uint64_t elements reverse sorted

OMP parallel sort : 0.317039 secs
Boost parallel sort : 0.581381 secs

OMP parallel stable sort : 1.06448 secs
Boost sample sort : 0.524746 secs
Boost parallel stable sort : 0.73036 secs

Asynchronous parallel sort : 0.0478701 secs

Asynchronous provides a special optimization for this case.
I this the library should do it too. This one is pretty common and a
popular DOS attack.

100000000 uint64_t elements randomly filled
OMP parallel sort : 1.03594 secs
Boost parallel sort : 0.796447 secs

OMP parallel stable sort : 1.28601 secs
Boost sample sort : 0.818954 secs
Boost parallel stable sort : 1.13604 secs

Asynchronous parallel quickspreadsort: 0.587432 secs
Asynchronous quick_intro_sort : 0.728393 secs

This mixing of quicksort degrading into a spreadsort works here best. The
parallel adaptation of intro_sort is not bad either and best of other
library algorithms.

Asynchronous parallel stable sort : 1.26141 secs
Asynchronous boost::stable sort : 0.804814 secs

Interesting. The stable version of the library easily beats
std::stable_sort I used until now.


10000000 strings randomly filled

OMP parallel sort : 1.05803 secs
Boost parallel sort : 0.933055 secs

OMP parallel stable sort : 1.12269 secs
Boost sample sort : 0.889216 secs
Boost parallel stable sort : 1.56564 secs

Asynchronous parallel quickspreadsort: 0.788856 secs
Asynchronous quick_intro_sort : 0.893652 secs

Asynchronous parallel stable sort : 1.23495 secs
Asynchronous boost::stable sort : 1.21817 secs

Similar results.


Let's move on to big objects:

1562500 elements of size 512 randomly filled
H E A V Y C O M P A R I S O N

OMP parallel sort : 0.308923 secs
Boost parallel sort : 0.342992 secs

Asynchronous parallel_quicksort : 0.246709 secs
Asynchronous quick_intro_sort : 0.269666 secs

Both quicksorts are best, with a slight advantage to intro_sort.

The light comparison is similar.


My test code (the modified benchmark.cpp provided by the library):
https://github.com/henry-ch/asynchronous/blob/master/libs/asynchronous/test/perf/boost_sort/benchmark_sort.cpp

The full test results:
https://github.com/henry-ch/asynchronous/blob/master/libs/asynchronous/test/perf/boost_sort/Results_sort.txt

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Francisco José Tapia

unread,
Nov 21, 2016, 4:02:01 AM11/21/16
to bo...@lists.boost.org
About the internal implementation:


The internal implementation need* threads which permit reuse the
initialized variables*. This feature is no implemented in the C++ standard
at today, and I didn’t see in the thread pool implementations which I
examined

I implemented this with threads and atomic variables. The thread are
created at the beginning of the program where initialize the thread pool
variables which are destroyed with the thread.* The threads are created
only 1 time*.

The parallel_for and parallel_while are implemented with atomic variables
in a short and easy way to understand and debug.

The works for to execute by the threads are implemented by std::function
objects stored in a concurrent stack. The work time estimated by each
object function is a few milliseconds.

This internal design is highly efficient with many threads. ( I checked in
a dual socket servers with 32 and 48 threads) and the results compared with
GCC parallel sort are better with many threads.


About the goals of this library :

The first is to create a library* extremely easy to understand and use*.
(You only need to include boost/sort/parallel/sort.hpp )

The library must be *independent of any other code*.( You can use
separately, only need to copy the boost/sort/parallel folder, and all the
code needed is included)

The library must use *only the C++11 compliant compiler*. The code can run
in an embedded multi core system or in a powerful processor, plenty of
cores, in a big server.

The advantage of the algorithms are

*parallel_sort, sample_sort* : this algorithm is commonly accepted as the
fastest for the parallel stable sort. The implementation is *notoriously
faster than the provided by the GCC compiler* ( based on Open MP ) using
less memory. TBB don’t provide parallel stable sort, and only have an
experimental code in their web pages

*parallel_sort* : the advantage of this algorithm, invented by me, is the
memory usage.* The speed is similar to GCC parallel sort, but THE MEMORY
USED IS A HALF.*

The algorithms are *exception safe*, meaning that the exceptions generated
by the algorithms guarantee the integrity of the objects to sort, but not
their relative order. If the exception is generated inside the objects (in
the move or in the copy constructor) the results can be unpredictable


Thanks by your interest


Francisco


* I take a quick look at your library, and see many interesting things
inside.

Steven Ross

unread,
Nov 21, 2016, 9:42:50 PM11/21/16
to bo...@lists.boost.org
On Sun, Nov 20, 2016 at 3:24 PM Christophe Henry <
christoph...@gmail.com> wrote:

> I did not manage yet to write a full review. One week is a bit short so a
> few more days will be necessary. I provide some performance tests first.
>

Thanks Christophe. We will be interested in incorporating your feedback to
improve the library.


>
> OTOH the single-threaded ones are interesting, especially the stable_sort
> and intro_sort for cases where usage of spreadsort is not possible.
>

I challenge you to find a sorting problem where it isn't possible to use
spreadsort. string_sort can sort anything that std::sort can; you just
need to convert the sort key comparison sequence into a sequence of bytes
and feed string_sort the correct functors. That said, it may not be
efficient to sort some problems with string_sort, and it'll probably be too
much work to be worth the effort for many problems.


>
> A short summary:
>
> 100000000 uint64_t elements already sorted:
> Boost parallel sort : 0.064965 secs
>
> Asynchronous parallel sort : 0.0167134 secs (same with other
> algorithms)
>
> Asynchronous provides a special optimization for this case.
>

If this optimization is practical to incorporate in Boost parallel sort, it
is a common case and your solution seems noticeably faster, so I would like
Francisco to consider it. That said, .06s isn't bad.


>
> I added this one:
> 100000000 uint64_t elements reverse sorted
>
> Boost parallel sort : 0.581381 secs
>
> Asynchronous parallel sort : 0.0478701 secs
>
> Asynchronous provides a special optimization for this case.
> I this the library should do it too. This one is pretty common and a
> popular DOS attack.
>

This optimization would be nice to have, if it's cheap.


>
> 100000000 uint64_t elements randomly filled
>
> Asynchronous parallel quickspreadsort: 0.587432 secs
> Asynchronous quick_intro_sort : 0.728393 secs
>
> This mixing of quicksort degrading into a spreadsort works here best.


I find this algorithm interesting; have you considered using MSD
radix-based sorting to break up the data for the threads in parallel? For
example, take T threads, have each thread take 1/T of the data and bucket
it into N MSD-based buckets (just like Spreadsort), where T < N <= ~2^11,
then merge the equivalent buckets, and as you merge them hand them off to
another thread to sort. This will probably divide better if you find the
max and min before bucketing using a parallel min/max search.


> 10000000 strings randomly filled
>
> OMP parallel sort : 1.05803 secs
> Boost parallel sort : 0.933055 secs
>
> OMP parallel stable sort : 1.12269 secs
> Boost sample sort : 0.889216 secs
> Boost parallel stable sort : 1.56564 secs
>
> Asynchronous parallel quickspreadsort: 0.788856 secs
> Asynchronous quick_intro_sort : 0.893652 secs
>
> Asynchronous parallel stable sort : 1.23495 secs
> Asynchronous boost::stable sort : 1.21817 secs
>
> Similar results.
>
>
> Let's move on to big objects:
>
> 1562500 elements of size 512 randomly filled
> H E A V Y C O M P A R I S O N
>
> OMP parallel sort : 0.308923 secs
> Boost parallel sort : 0.342992 secs
>
> Asynchronous parallel_quicksort : 0.246709 secs
> Asynchronous quick_intro_sort : 0.269666 secs
>
> Both quicksorts are best, with a slight advantage to intro_sort.
>
> The light comparison is similar.
>

Do we have your permission to copy and/or adapt your quick_intro_sort
algorithm into the Boost.Sort library? How is the worst-case performance
relative to Francisco's library?

Christophe Henry

unread,
Nov 22, 2016, 2:56:03 PM11/22/16
to bo...@lists.boost.org
Hi,

> The threads are created only 1 time*.

Have I missed something? I checked sample_sort and I saw several sets of
calls to std::async which is likely to create and destroy threads several
times.

>*parallel_sort, sample_sort* : this algorithm is commonly accepted as the
>fastest for the parallel stable sort.
>The implementation is *notoriously faster than the provided by the GCC
compiler*
>( based on Open MP ) using less memory. TBB don?t
>provide parallel stable sort, and only have an experimental code in their
web pages

Indeed, I found it highly interesting.

Cheers,
Christophe

Christophe Henry

unread,
Nov 22, 2016, 3:23:53 PM11/22/16
to bo...@lists.boost.org
>>
>> OTOH the single-threaded ones are interesting, especially the stable_sort
>> and intro_sort for cases where usage of spreadsort is not possible.
>>

>I challenge you to find a sorting problem where it isn't possible to use
>spreadsort. string_sort can sort anything that std::sort can;
> you just need to convert the sort key comparison sequence into a sequence
of bytes
> and feed string_sort the correct functors. That said, it may not be
>efficient to sort some problems with string_sort,
>and it'll probably be too much work to be worth the effort for many
problems.

This is what I meant. Sorry for the lack of precision.
I'll add it to my performance tests, it might be an interesting variant as
the algorithm is really fast.

>>
>> A short summary:
>>
>> 100000000 uint64_t elements already sorted:
>> Boost parallel sort : 0.064965 secs
>>
>> Asynchronous parallel sort : 0.0167134 secs (same with other
>> algorithms)
>>
>> Asynchronous provides a special optimization for this case.
>>

>If this optimization is practical to incorporate in Boost parallel sort,
>it is a common case and your solution seems noticeably faster,
>so I would like Francisco to consider it. That said, .06s isn't bad.

Sure.

>>
>> I added this one:
>> 100000000 uint64_t elements reverse sorted
>>
>> Boost parallel sort : 0.581381 secs
>>
>> Asynchronous parallel sort : 0.0478701 secs
>>
>> Asynchronous provides a special optimization for this case.
>> I this the library should do it too. This one is pretty common and a
>> popular DOS attack.
>>

> This optimization would be nice to have, if it's cheap.

A parallel reverse is in O(n) so it's cheap compared to a sort.

>>
>> 100000000 uint64_t elements randomly filled
>>
>> Asynchronous parallel quickspreadsort: 0.587432 secs
>> Asynchronous quick_intro_sort : 0.728393 secs
>>
>> This mixing of quicksort degrading into a spreadsort works here best.


>I find this algorithm interesting;

Frankly, there is nothing special. It's simply a parallel partition, which
degrades into a "normal" parallel merge sort (in this case spreadsort)
after a while.

>have you considered using MSD radix-based sorting to break up the data
> for the threads in parallel?
>For example, take T threads, have each thread take 1/T of the data
> and bucket it into N MSD-based buckets (just like Spreadsort),
>where T < N <= ~2^11, then merge the equivalent buckets,
>and as you merge them hand them off to another thread to sort.
>This will probably divide better if you find the max and min before
bucketing
>using a parallel min/max search.

I admit I didn't. But it sounds really interesting and I think I'll have a
go at it.
I'm not a sort expert though. I approached this from the parallel
perspective.

>> 10000000 strings randomly filled
>>
>> OMP parallel sort : 1.05803 secs
>> Boost parallel sort : 0.933055 secs
>>
>> OMP parallel stable sort : 1.12269 secs
>> Boost sample sort : 0.889216 secs
>> Boost parallel stable sort : 1.56564 secs
>>
>> Asynchronous parallel quickspreadsort: 0.788856 secs
>> Asynchronous quick_intro_sort : 0.893652 secs
>>
>> Asynchronous parallel stable sort : 1.23495 secs
>> Asynchronous boost::stable sort : 1.21817 secs
>>
>> Similar results.
>>
>>
>> Let's move on to big objects:
>>
>> 1562500 elements of size 512 randomly filled
>> H E A V Y C O M P A R I S O N
>>
>> OMP parallel sort : 0.308923 secs
>> Boost parallel sort : 0.342992 secs
>>
>> Asynchronous parallel_quicksort : 0.246709 secs
>> Asynchronous quick_intro_sort : 0.269666 secs
>>
>> Both quicksorts are best, with a slight advantage to intro_sort.
>>
>> The light comparison is similar.
>>

>Do we have your permission to copy and/or adapt your quick_intro_sort
>algorithm into the Boost.Sort library?

Please feel free to copy whatever might be useful. The library has a Boost
license as it's intended for inclusion in Boost.I also took the freedom of
adding the single-thread versions of Fernando's algorithms as basis for my
parallel versions.
The algorithm is quite simple. It's the same as quick_spreadsort, a
quicksort which degrades into Francisco's intro_sort after some iterations.
I just found it interesting to add a wrapper to Francisco's algorithm for
the mini-review.
It was just a quick test (no time for cleverer algorithms) but I'm pretty
satisfied with the result and I'll keep these versions.

>How is the worst-case performance relative to Francisco's library?

The sort versions have O(n log n) like any merge sort. The quicksort have
O(n log n) to O(n^2) as expected from a quicksort.
Reply all
Reply to author
Forward
0 new messages