[boost] [SORT] Parallel Algorithms

Francisco José Tapia

unread,

Dec 14, 2014, 1:03:17 PM12/14/14

to Boost List

Sorry by the delay, As promised, the sorting parallel algorithms

On my computer (Quad Core) the GCC std::sort and stable_sort are slight
faster than my version of intro_sort and merge_sort (stable sort is a
synonym of merge_sort). In other computer with an Intel I3, my algorithms
are slight faster than in the GCC version.

Trying to optimize the parallel implementation. I modified the parallel
version for to use the std::sort and std::stable_sort. The parallel_sort
version response good, but the parallel_stable_sort increase the time
around a 60% compared with the version with
countertree::parallel_merge_sort .

After many test, I discover the cause of the problem. The demonstration is
in the test_stress_sort_01.cpp.

In this program we have a big vector of 100 000 000 elements uint64_t. The
program make parts of the same size for each HW core. On my computer 4
parts and 4 cores. The program execute the part sequentially checking the
time, and after do the same but in parallel, each thread sort a part, and
check the time. This is done with the std::stable sort and
countertree::merge_sort. The data input is the same for all. The next lines
are the result show

max. OpenMP threads: 4 (4 processors)

Sort of 100000000 elements in a vector ---------------------

Random elements, few repeated ( rand() )-----------------------

Stable Sort

Part 0 secs :2.77955

Part 1 secs :2.75015

Part 2 secs :2.76068

Part 3 secs :2.76284

Parallel Sort secs :8.49796

Countertree stable sort

Part 0 secs :2.72802

Part 1 secs :2.71614

Part 2 secs :2.72074

Part 3 secs :2.722

Parallel Sort secs :4.80692

Surprisingly the parallel sort with countertree::stable_sort is much more
faster than the sort with std::stable_sort in the GCC 4.8 and 4.9 compiler.
This is the secret of the speed of the countertree::parallel_merge_sort.

This is very important, because when the size of the elements is small,
the GCC and the countertree algorithms have similar performance. But when
the size of the elements grow, the std::stable sort is slower than the
countertree::merge_sort. This difference is increased in the parallel
version. The next lines are part of benchmark_sort_05.cpp

***************************************************************

Sort of N = 6250000 elements of size 128

*****************************************************************

Random elements, few repeated ( rand() )-----------------------

STL std::sort :2.9188 secs

countertree::intro_sort :2.87864 secs

parallel::sort :2.46113 secs

tbb::parallel_sort :1.86458 secs

countertree::parallel_sort:1.89263 secs

std::stable_sort :8.77963 secs

countertree::merge_sort :5.70694 secs

parallel::stable_sort :8.03946 secs

countertree::parallel_merge_sort :4.90815 secs

I have in mind a new version of the stable sort in order to improve the
speed, reducing the number of internal copy of elements. I hope to have
time in Christmas, but I can't guarantee. If the results are the expected I
will send you

You have too, a version of indirect_sort, which sort pointers to the
elements, and at end sort the elements according the pointers. You can
replace any sort algorithm by their indirect version, obtaining the same
results, with the only difference of the time needed. You have a
benchmark_indirect_sort.cpp where see the comparison of the time needed of
several sort methods and their indirect version with elements of different
size.

This code is a preliminary version, done for my countertree library. If
you consider useful, I will change the name space and document the code,
the algorithms and the ideas obtained from the benchmarks.

Please feel free for ask me any question or for to modify or change
anything. The goal is to provide good sort methods to the users.

In the zip file you have the code and a file with a description of each
code.

Yours

Sort.7z

Steven Ross

unread,

Dec 14, 2014, 9:52:58 PM12/14/14

to bo...@lists.boost.org

Franciso,

On Sun, Dec 14, 2014 at 1:03 PM, Francisco José Tapia <fjt...@gmail.com> wrote:
> Sorry by the delay, As promised, the sorting parallel algorithms
>
> On my computer (Quad Core) the GCC std::sort and stable_sort are slight
> faster than my version of intro_sort and merge_sort (stable sort is a
> synonym of merge_sort). In other computer with an Intel I3, my algorithms
> are slight faster than in the GCC version.

An I3 is low-end hardware. I recommend testing on an i7 or faster system.

> This code is a preliminary version, done for my countertree library. If
> you consider useful, I will change the name space and document the code,
> the algorithms and the ideas obtained from the benchmarks.
>
> Please feel free for ask me any question or for to modify or change
> anything. The goal is to provide good sort methods to the users.
>
> In the zip file you have the code and a file with a description of each
> code.

First, thanks for sending this in a fairly easy to test format.
I suggest updating your docs to recommend single-line compilation (if
appropriate), which saves a step:
Sort/Modified/benchmark/GCC/util_sort$ g++ -std=c++11 -Wall
-fexceptions -fopenmp -fomit-frame-pointer -fexpensive-optimizations
-O3 -I../../../src benchmark_sort_03.cpp -fopenmp -s -lpthread -ltbb
-o benchmark_sort_03 -pthread

I was unable to run your indirect sort benchmark:
Sort/Modified/benchmark/GCC/util_sort$ g++ -std=c++11 -Wall
-fexceptions -fopenmp -fomit-frame-pointer -fexpensive-optimizations
-O3 -I../../../src -c benchmark_sort_indirect.cpp -o
benchmark_sort_indirect.o -pthread
Sort/Modified/benchmark/GCC/util_sort$ g++ -o benchmark_sort_indirect
benchmark_sort_indirect.o -pthread -fopenmp -s -lpthread -ltbb
Sort/Modified/benchmark/GCC/util_sort$ ./benchmark_sort_indirect
Size of element 8 Number of elements :100000000
terminate called after throwing an instance of 'std::system_error'
what(): Enable multithreading to use std::thread: Operation not permitted
Aborted (core dumped)

I'm actually quite interested in the generalized indirect sort idea (I
looked at your two functions). Do you think you could package that up
so it's a simple wrapper, where the user passes in the sort function
they want to use, and it takes care of changing the input and
comparison (if possible) to be indirect, running the sort, and then
swapping into place based on the results of the indirect sort? If
that can be done without significant loss of efficiency, it would seem
quite useful (I've had do indirect sorts multiple times, and it's
always some code). Even your two functions look useful on their own.

I suggest using the boost random number generator (or if you stick
with the c++11 requirement, which I don't recommend, the std::
equivalent):
#include <boost/random/mersenne_twister.hpp>
#include <boost/random/uniform_int_distribution.hpp>
random::mt19937 generator;
random::uniform_int_distribution<int> distribution;
And for each random number:
int result = distribution(generator)
You may want to reuse and modify some of my tests in
https://github.com/spreadsort/sort/tree/master/example

rand() is returning lots of duplicate elements, as you're sorting more
than RAND_MAX elements.

I'm not sure why you're putting so much effort into competing with
stable_sort. Some people use it, but I've never seen it in actual
production code.
std::stable_sort is not just merge sort, as based on the documentation
it can use less memory than merge sort:
http://www.cplusplus.com/reference/algorithm/stable_sort/
Complexity
If enough extra memory is available, linearithmic in the distance
between first and last: Performs up to N*log2(N) element comparisons
(where N is this distance), and up to that many element moves.
Otherwise, polyloglinear in that distance: Performs up to N*log22(N)
element comparisons, and up to that many element swaps.

The speed results I get on my system for your library are below. It
just doesn't seem competitive with tbb for speed on my system. If you
get it there (within 5% on random, sorted, and mostly-sorted data on
both Windows and Linux), or if you find someone else with a genuine
need to use this in production code, I'll be interested to take
another look. This is compiled using g++ (Ubuntu 4.8.2-19ubuntu1)
4.8.2 and run on a Intel(R) Core(TM) i7-3612QM CPU @ 2.10GHz:

max. OpenMP threads:8 (processors)8
Sort of 200000000 elements in a vector -------------------

Sorted elements -----------------------------------------------
STL std::sort :3.60897 secs
parallel::sort :1.24468 secs
tbb::parallel_sort :0.059404 secs
countertree::parallel_sort :1.25563 secs

std::stable_sort :10.3108 secs
parallel::stable_sort :3.1866 secs
countertree::parallel_merge_sort :3.71677 secs

Reverse sorted elements ---------------------------------------
STL std::sort :2.57237 secs
parallel::sort :1.02421 secs
tbb::parallel_sort :0.895415 secs
countertree::parallel_sort :1.30754 secs

std::stable_sort :10.4286 secs
parallel::stable_sort :3.21695 secs
countertree::parallel_merge_sort :4.08913 secs

Random elements, few repeated ( rand() )-----------------------

STL std::sort :19.7451 secs
parallel::sort :3.98043 secs
tbb::parallel_sort :4.28722 secs
countertree::parallel_sort :6.85041 secs

std::stable_sort :18.6291 secs
parallel::stable_sort :4.08416 secs
countertree::parallel_merge_sort :6.09388 secs

Random elements, quite repeated ( rand() % (NELEM/2) )---------
STL std::sort :19.5069 secs
parallel::sort :3.93207 secs
tbb::parallel_sort :4.66967 secs
countertree::parallel_sort :6.29049 secs

std::stable_sort :18.8519 secs
parallel::stable_sort :4.31173 secs
countertree::parallel_merge_sort :6.1165 secs

Random element many repeated (rand() % 10000)--------------------
STL std::sort :11.8256 secs
parallel::sort :2.65534 secs
tbb::parallel_sort :3.44815 secs
countertree::parallel_sort :3.79712 secs

std::stable_sort :18.7368 secs
parallel::stable_sort :3.70629 secs
countertree::parallel_merge_sort :6.09851 secs

Equal elements --------------------------------------------------
STL std::sort :4.50343 secs
parallel::sort :1.44262 secs
tbb::parallel_sort :0.0588514 secs
countertree::parallel_sort :1.53466 secs

std::stable_sort :10.382 secs
parallel::stable_sort :3.22236 secs
countertree::parallel_merge_sort :4.17672 secs

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Francisco José Tapia

unread,

Dec 16, 2014, 1:42:57 PM12/16/14

to Boost List

Hi Steven,

I thought all the program was checked, but obviously not. Now in the
folder you have a shell script with the command for to compile and link
with optimization ( by example make_benchmark_sort_03.sh), and a
make_all.sh which compile all. Open a terminal in the folder, and compile
and run

All the programs use the HW thread and use all available. All the programs
had been done for to run in a Linux machine with 4 G of RAM. The number of
element involved in the sort is defined at the beginning of the program
with the #define NELEM. You can change this value, and compile again.

As suggested me, change the number generator and now use a Mersenne of the
standard library.

The stable sort is less used than the non stable sort, but sometimes is
important. I developed because in my project have a proxy , which receives
the operations over the elements defined by a key. The proxy sort, and send
the operations to the corresponding thread, and it is not the same, read
the value of a key and after delete that key, than delete the key and after
try to read the value. I can't add a time mark for to know which arrived
before

In order to clarify the information obtained from the benchmarks :

In the folder Modified the GCC parallel:sort and parallel:merge_sort, and
the countertree::parallel_sort and countertree::parallel_merge_sort use the
std::sort and the std::merge_sort. The utility of this folder is the
comparison with the same programs of the folder Original, the result show
the problems described in this message.

In the folder Original the GCC parallel:sort and parallel:merge_sort, use
the std::sort and the std::merge_sort., and the countertree::parallel_sort
use countertree::intro_sort and countertree::parallel_merge_sort use
countertree::merge_sort.

SCALABILITY : Some algorithms don't scale well when the number of threads
grow. Increasing the number of threads don't decrease the execution time.
This lack of scalability is emphasized when the size of the elements to
sort grows

As say in the previous message, the stable sort, with 1 thread need 2.7,
and with 4 HW threads need 8.49, around 3 times the time of 1 thread. And
the countertree::merge_sort need 2.7 for to sort with 1 thread and with 4
HW threads need 4.89, secs ( See the [Original test_stress_sort.cpp ] )

This show countertree::merge_sort is a good for to use in a parallel SW
and GCC std::stable_sort is bad.

If you see the results of the countertree:parallel_merge_sort in the
benchmarks in the folder Modified, are really bad compared with the same
results in the folder Original. The difference is the internal algorithm.
In the folder Modified use GCC std::stable_sort, and in the Original use
countertree::merge_sort

The GCC std::sort and countertree::intro_sort have similar performance
with all the size of elements. But in the parallel process, when the size
of the elements grow, the poor scalability appear. The next lines are are
from Original benchmark_sort_05.cpp on my computer

With 128 bytes elements

Random elements, few repeated -----------------------

STL std::sort :2.9188 secs

countertree::intro_sort :2.87864 secs

parallel::sort :2.46113 secs

tbb::parallel_sort :1.86458 secs

countertree::parallel_sort:1.89263 secs

With 256 bytes elements

Random elements, few repeated ( rand() )-----------------------

STL std::sort :3.03143 secs

countertree::intro_sort :3.02733 secs

parallel::sort :2.23231 secs

tbb::parallel_sort :1.76186 secs

countertree::parallel_sort:1.56509 secs

As we can see the parallel process is ineffective with GCC parallel::sort
with big objects.

About the merge_sort algorithm. It use additionally half of the memory
used by the data. But if you don't use additional memory, the speed drops.
I didn't examine in deep the Algorithm used in GCC, I only codified my own
algorithm. I will examine, and I will try to reduce the memory used,
without lost of speed and scalability.

About the server for to test. I don't have modern servers in my school.
The budget cut eliminate the new servers. I have an I7 Xeon , but is a 3
years old machine with 4 HW threads, and in the benchmarks is slower than
the I3 and I5 machined mounted this year. If someone can check and provide
us the result, I will be grateful. The version 4.9.1 of GCC have better
optimization than the 4.8.2. If you are using Ubuntu, the version 14.10
have the 4.9.1 version as predefined.

Yours

Francisco Tapia

Sort.7z

Steven Ross

unread,

Dec 18, 2014, 10:14:51 PM12/18/14

to bo...@lists.boost.org

Francisco,

On Tue Dec 16 2014 at 1:42:59 PM Francisco José Tapia <fjt...@gmail.com>
wrote:

> Hi Steven,

>
> I thought all the program was checked, but obviously not. Now in the
> folder you have a shell script with the command for to compile and link
> with optimization ( by example make_benchmark_sort_03.sh), and a
> make_all.sh which compile all. Open a terminal in the folder, and compile
> and run
>

> Great! Thanks. I ran everything, and the only failure was with
benchmark_sort_indirect:

./benchmark_sort_indirect
Size of element 8 Number of elements :100000000
terminate called after throwing an instance of 'std::system_error'
what(): Enable multithreading to use std::thread: Operation not permitted
Aborted (core dumped)

>

> As suggested me, change the number generator and now use a Mersenne of the
> standard library.
>

> Thanks

> The stable sort is less used than the non stable sort, but sometimes is
> important. I developed because in my project have a proxy , which receives
> the operations over the elements defined by a key. The proxy sort, and send
> the operations to the corresponding thread, and it is not the same, read
> the value of a key and after delete that key, than delete the key and after
> try to read the value. I can't add a time mark for to know which arrived
> before
>

> Fair enough.

> As say in the previous message, the stable sort, with 1 thread need 2.7,
> and with 4 HW threads need 8.49, around 3 times the time of 1 thread. And
> the countertree::merge_sort need 2.7 for to sort with 1 thread and with 4
> HW threads need 4.89, secs ( See the [Original test_stress_sort.cpp ] )
>
> This show countertree::merge_sort is a good for to use in a parallel SW
> and GCC std::stable_sort is bad.
>

The numbers I see are not as extreme, but your merge_sort looks good in
comparison, especially with large elements. I will investigate
alternatives to see if this is a truly speed-competitive offering.
Have you considered Timsort (standard in Java and Python)? Timsort is a
stable sort that is particularly good with sorted, mostly-sorted, and
reverse-sorted data. It might be worth offering as an alternative stable
sort for those concerned about mostly-sorted data.

Your intro_sort looks reasonable overall, but it varies from comparable
speed to 14% slower when run in parallel on random data, and is an order of
magnitude slower on already-sorted data. As tbb is free, I just don't see
a compelling case for your introsort implementation; tbb is a very hard
competitor to beat.

The GCC std::sort and countertree::intro_sort have similar performance
> with all the size of elements. But in the parallel process, when the size
> of the elements grow, the poor scalability appear. The next lines are are
> from Original benchmark_sort_05.cpp on my computer
>
>
> With 128 bytes elements
>
> Random elements, few repeated -----------------------
>
> STL std::sort :2.9188 secs
>
> countertree::intro_sort :2.87864 secs
>
> parallel::sort :2.46113 secs
>
> tbb::parallel_sort :1.86458 secs
>
> countertree::parallel_sort:1.89263 secs
>
>
> With 256 bytes elements
>
> Random elements, few repeated ( rand() )-----------------------
>
> STL std::sort :3.03143 secs
>
> countertree::intro_sort :3.02733 secs
>
> parallel::sort :2.23231 secs
>
> tbb::parallel_sort :1.76186 secs
>
> countertree::parallel_sort:1.56509 secs
>
> As we can see the parallel process is ineffective with GCC parallel::sort
> with big objects.
>

> Those don't match up with my results for 256 byte elements:
STL std::sort :1.04718 secs
countertree::intro_sort :1.14214 secs
parallel::sort :0.679178 secs
tbb::parallel_sort :0.600001 secs
countertree::parallel_sort:0.627144 secs

And this is what I see for size 64:
STL std::sort :1.6577 secs
countertree::intro_sort :1.73023 secs
parallel::sort :0.963199 secs
tbb::parallel_sort :0.873805 secs
countertree::parallel_sort:0.997228 secs

I'm not sure the benefits you are seeing are portable.

>
> About the merge_sort algorithm. It use additionally half of the memory
> used by the data. But if you don't use additional memory, the speed drops.
> I didn't examine in deep the Algorithm used in GCC, I only codified my own
> algorithm. I will examine, and I will try to reduce the memory used,
> without lost of speed and scalability.
>

The standard algorithm says it will lose speed when it reduces memory used;
it'd be a nice feature to have, but if we can repeatably see 2X relative
speedups without that feature, I think some people may find it useful.

>
> About the server for to test. I don't have modern servers in my school.
> The budget cut eliminate the new servers. I have an I7 Xeon , but is a 3
> years old machine with 4 HW threads, and in the benchmarks is slower than
> the I3 and I5 machined mounted this year. If someone can check and provide
> us the result, I will be grateful. The version 4.9.1 of GCC have better
> optimization than the 4.8.2. If you are using Ubuntu, the version 14.10
> have the 4.9.1 version as predefined.
>

> A boost library should still be useful for people at least one compiler
version back; I feel fully justified testing on 4.8.2. Have you tested on
Windows yet?

Steven Ross

unread,

Dec 18, 2014, 10:22:17 PM12/18/14

to bo...@lists.boost.org

Francisco,

Have you looked at:
https://software.intel.com/en-us/articles/a-parallel-stable-sort-using-c11-for-tbb-cilk-plus-and-openmp

For comparison?

Steven Ross

unread,

Dec 22, 2014, 6:09:07 AM12/22/14

to bo...@lists.boost.org

Francisco,

On Thu, Dec 18, 2014 at 10:22 PM, Steven Ross <sprea...@gmail.com> wrote:
> Francisco,
>
> Have you looked at:
> https://software.intel.com/en-us/articles/a-parallel-stable-sort-using-c11-for-tbb-cilk-plus-and-openmp
>
> For comparison?

I used the free tbb parallel stable sort referenced above for
comparison, and found that as soon as I randomized the contents of the
entire struct and made the copy constructor copy over the entire
struct, that the tbb version was 37% faster on randomized data all the
way up to 256 bytes relative to countertree::parallel_merge_sort.
I've attached my modified "original" directory where I tested this
out (see: Original/benchmark/parallel_stable_sort/build/make_tbb_benchmark.sh).
Unless you can get your library close to the speed of this tbb
sort, I don't see how we'd be benefiting people by pointing to it
instead of the tbb library.

What I am interested in is your idea for indirect sorting: can you
come up with an easy-to-use API to handle efficient indirect sorting?
That would probably be worth including in the boost::sort library,
especially if it is compatible with different sort functions.

original.tar.gz

Francisco José Tapia

unread,

Dec 22, 2014, 4:56:28 PM12/22/14

to Boost List

Hi Steven,

I had been working about the memory used by the different algorithms,
specially the stable algorithms. I developed a low memory algorithm
circular_stable_sort, which use only a 10% more of memory, and the speed
decrease around 15% with small objects and with big objects is similar to
the merge_sort

I was looking too, the TimSort algorithm. I found a C++ implementation in
https://github.com/gfx/cpp-TimSort .

I inserted this algorithm in the benchmark programs. With small objects,
the speed is slower than the stable sort and merge_sort, and have a good
performance with big objects, greater than 64 bits. It's logic the decision
of Java and Python, because they must order objects, usually non contiguous
and bigger than the naked number.

For the measure of the memory used by the program I use the command
/usr/bin/time -v program_to_execute. I take the benchmark_algorithm_04
(100000000 uint64_t numbers), and comment all the invocations to the
algorithms , except one, compile, and run in a terminal with the command
/usr/bin/time -v benchmark_sort_04.

I repeat this process with all the algorithms for to know the memory used
by each algorithm.

Algorithm

Memory used

std::sort

1 565 456

countertree::intro_sort

1 565 280

GCC parallel::sort

2 346 868

tbb::parallel_sort

1 565 696

countertree::parallel_sort

1 565 468

std::stable_sort

1 957 232

countertree::merge_sort

1 955 908

countertree::circular_merge_sort

1 696 316

sfx::timsort

1 996 364

parallel::stable_sort

2 742 384

countertree::parallel_merge_sort

1 956 636

I checked the time with benchmark_sort_05, with copy all the structure in
the operation = and in the copy constructor. The number of elements in each
sort is 800000000/size of the element in bytes

8

bytes

16

bytes

32

bytes

48

bytes

64

bytes

128

bytes

256

bytes

std::stable_sort

16.99

10.36

9.75

9.62

8.85

8.89

10.25

countertree::merge_sort

16.96

10.38

8.13

7.44

7.10

6.67

9.01

countertree::circular_merge_sort

19.77

12.80

8.81

7.61

7.01

6.48

8.62

timsort

22.89

13.43

10.12

8.33

7.70

7.00

8.62

parallel::stable_sort

9.68

8.91

8.90

8.90

8.23

8.21

7.72

countertree::parallel_merge_sort

7.04

5.96

5.74

5.70

5.64

5.54

5.94

About the stable sort of tbb, I see, but I didn't have time to examine and
check in deep. But be quiet. The benchmark_sort indirect shows

Size of element 48 Number of elements :16666666

parallel_sort :2.53964 sec

indirect_parallel_sort :5.70979 sec

parallel_stable_sort :5.73186 sec

indirect_parallel_stable_sort :5.59269 sec

Size of element 64 Number of elements :12500000

parallel_sort :2.59812 sec

indirect_parallel_sort :4.1553 sec

parallel_stable_sort :5.63771 sec

indirect_parallel_stable_sort :3.95266 sec

Size of element 128 Number of elements :6250000

parallel_sort :2.28648 sec

indirect_parallel_sort :2.34792 sec

parallel_stable_sort :5.62879 sec

indirect_parallel_stable_sort :2.30188 sec

Size of element 256 Number of elements :3125000

parallel_sort :2.05017 sec

indirect_parallel_sort :1.50661 sec

parallel_stable_sort :5.85409 sec

indirect_parallel_stable_sort :1.48561 sec

As you can see with a 256 bytes of size the indirect parallel_stable_sort
is 3 times faster. And the indirect_sort is a 25% faster.

The final solution, if at end is taken, will be a hybrid algorithm
depending of the size of the elements

I propose you the next. Let me 2 or 3 weeks and I will prepare a big
benchmark with all the algorithms, including the indirect version of the
stable and unstable sort. We will run the benchmarks over different
machines, and according the results, take a decision about if we propose
something to Boost, and which algorithm must be used in the hybrid version.

If, at end, we decide don't propose nothing, I only can say this had been
a nice and exciting experience

Mi intention is not to impose my code, I want provide to the final users
the best solution, if exist. With my code or with the code of other. I am
the first to clap the winner.

We are doing this because this parallel implementation don't need any
other library, only the standard. There are compilers without OpenMP or TBB
support, or simply, the user don't want to load them for to make a sort.

I have in mind, several improvements of the algorithms. I will codify and
test, and if run well, I include in the algorithms. I want, too, simplify
the interface of indirect_sort, for to be used in a easy way by any sort
algorithm. If you are interested, I will prepare too a brief document with
the explanation and the internal algorithm.

At end, only say Happy Christmas, and New Year to the Boost community.

Yours

Francisco Tapia

fjt...@gmail.com

Francisco José Tapia

unread,

Dec 22, 2014, 5:04:17 PM12/22/14

to Boost List

I forget to mention the Windows version of the benchmark. I supouse, they
results will be similar to the obtained in Linux. I will prepare, too

With all the information, we can decide the best solution if exist.

Thanks

Francisco Tapia

Steven Ross

unread,

Dec 22, 2014, 10:34:29 PM12/22/14

to bo...@lists.boost.org

Francisco,

On Mon Dec 22 2014 at 5:04:17 PM Francisco José Tapia <fjt...@gmail.com>
wrote:

> > I had been working about the memory used by the different algorithms,

> > specially the stable algorithms. I developed a low memory algorithm
> > circular_stable_sort, which use only a 10% more of memory, and the speed
> > decrease around 15% with small objects and with big objects is similar to
> > the merge_sort
>

> Some people may find that useful. What is the worst-case computational
complexity? Is it a well-known algorithm? The bar is going to be higher
(may require a full review) for previously unknown algorithms.

> > I was looking too, the TimSort algorithm. I found a C++ implementation
> > in https://github.com/gfx/cpp-TimSort .
> >
> > I inserted this algorithm in the benchmark programs. With small objects,
> > the speed is slower than the stable sort and merge_sort, and have a good
> > performance with big objects, greater than 64 bits. It's logic the
> decision
> > of Java and Python, because they must order objects, usually non
> contiguous
> > and bigger than the naked number.
> >
> > For the measure of the memory used by the program I use the command
> > /usr/bin/time -v program_to_execute. I take the benchmark_algorithm_04
> > (100000000 uint64_t numbers), and comment all the invocations to the
> > algorithms , except one, compile, and run in a terminal with the command
> > /usr/bin/time -v benchmark_sort_04.
> >
> > I repeat this process with all the algorithms for to know the memory
> > used by each algorithm.
>

I suggest changing the struct you use for benchmarking to this:
template <uint32_t NN>
struct reburcio
{
uint64_t M[NN];
reburcio () {
init();
}

void init() {
for (int i = 0; i < NN; ++i) {
M[i] = my_rand();
}
}

bool operator < ( const reburcio &R) const{
for (int i = 0; i < NN; ++i) {
if (M[i] != R.M[i]) {
return M[i] < R.M[i];
}
}
return false;
}
};

That way it's using the default copy and assignment operators, which copy
over the entire datastructure, and it initializes and compares all elements
of the array. It appears that the benchmark you sent me only copies over
and compares the first element in the array, which might have caused some
of the large-element differences in performance you were seeing.

With this data structure, countertree intro_sort is competitive on my linux
system with std::sort on fully random data, and countertree parallel_sort
is competitive for 8, 16, and 256 bytes per element, but significantly
underperforms (10-15%) tbb for 32 and 64 bytes.

> > As you can see with a 256 bytes of size the indirect
> > parallel_stable_sort is 3 times faster. And the indirect_sort is a 25%
> > faster.
>

Yes, indirect sort should eventually be faster, given a large enough
element, especially if indirect sort is done with the fastest underlying
algorithm available.

> >
> > The final solution, if at end is taken, will be a hybrid algorithm
> > depending of the size of the elements
>

That sounds reasonable for fixed-size elements. What about variable-length
elements, like strings? Will this only be for fixed-size elements?

> > I propose you the next. Let me 2 or 3 weeks and I will prepare a big
> > benchmark with all the algorithms, including the indirect version of the
> > stable and unstable sort. We will run the benchmarks over different
> > machines, and according the results, take a decision about if we propose
> > something to Boost, and which algorithm must be used in the hybrid
> version.
> >
> > If, at end, we decide don't propose nothing, I only can say this had
> > been a nice and exciting experience
>

I think indirect sorting is promising.

> >
> > Mi intention is not to impose my code, I want provide to the final users
> > the best solution, if exist. With my code or with the code of other. I am
> > the first to clap the winner.
> >
> > We are doing this because this parallel implementation don't need any
> > other library, only the standard. There are compilers without OpenMP or
> TBB
> > support, or simply, the user don't want to load them for to make a sort.
>

> Agreed, but OpenMP is very common (even Android supports it)!, so systems
without OpenMP are a very small niche in terms of total systems out there.
TBB is supported on intel-compatible processors, and the vast majority of
deployed processors where parallel sorting makes sense (won't drain the
battery really quickly) are intel-compatible. If we're going to provide an
alternative, it should cost very little in terms of performance, and
ideally provide some kind of benefit besides eliminating a dependency.

> > I have in mind, several improvements of the algorithms. I will codify
> > and test, and if run well, I include in the algorithms. I want, too,
> > simplify the interface of indirect_sort, for to be used in a easy way by
> > any sort algorithm. If you are interested, I will prepare too a brief
> > document with the explanation and the internal algorithm.
>

> I'd like to see that document.
Along with testing a struct that is an array of ints, you should probably
test variable-length string sorting, as string sorting is very common.

Merry Christmas,

Steve

Francisco José Tapia

unread,

Dec 23, 2014, 3:31:33 AM12/23/14

to Boost List

We can do too a benchmark with variable lenght elements with strings.

If you prepare the operations for to do in that test, I can prepare an
additional benchmark with all the algorithms and your operations.

After all, we see and take a decision

Yours

Francisco

Steven Ross

unread,

Dec 23, 2014, 6:49:37 AM12/23/14

to bo...@lists.boost.org

Francisco,

On Tue Dec 23 2014 at 3:31:32 AM Francisco José Tapia <fjt...@gmail.com>
wrote:

> We can do too a benchmark with variable lenght elements with strings.

>
> If you prepare the operations for to do in that test, I can prepare an
> additional benchmark with all the algorithms and your operations.
>
> After all, we see and take a decision
>
> Yours
>
> Francisco
>

The approach I've used with the boost::sort library is to set up the
benchmark tests so that they read in a file of random data generated by a
single application (randomgen), or any other file that is passed in for
testing, and they write out their results to file. This enables better
testing and debugging in case there are any problems (I directly diff the
sort results when I can expect them to be identical; tune.pl does this
automatically).
I would prefer if you used stringsample.cpp and int64.cpp as examples of
how to do this, and downloaded the boost sort library and added any
modifications you need into that code. The library has a tune.pl script
that runs the benchmarks. It may be helpful to write some utility
templated funciton to switch between different algorithms, instead of the
current approach of using a "-std" switch to use std::sort.
https://github.com/spreadsort/sort
There is a README there on how to install it using modular boost.

Francisco José Tapia

unread,

Dec 23, 2014, 6:00:11 PM12/23/14

to Boost List

Hi Steven

Thanks by your ideas. I will check.

Now I am designing, developing and testing several improvements for the
algorithms. When finish I will prepare a clear and easy interface for to
transform any sort algorithm, even the parallel, in a indirect sort, and
prepare a brief document where explain how to do.

Perhaps it will be a good moment for to move all the code to the namespace
boost::sort

Only after these things, I will begin to prepare the big benchmark. And
then, I would like your experience and opinion.

My idea is to make the benchmark running in a easy way, if we want know
the opinion and results of other members of the Boost community, we must
try to simplify their work.

I will inform of my progress, but I must travel for to be with my family,
and until January, I didn't run at full speed.

Happy Christmas. My best wishes for you and the Boost community

Francisco

Francisco José Tapia

unread,

Jan 13, 2015, 12:43:01 PM1/13/15

to Boost List

Hi Steven,

This Christmas I had been working hard. I developed a new version of the
merge sort about 10-15% faster than the previous, using an additional
memory only ½ of the memory used by the data. The GCC and TBB algorithms
use an additional memory as the size of the data.

This version is exception safe. If an exception occurs due to the
algorithm ( only throw exceptions in the allocation and deallocation of the
additional memory), the integrity of the data is guarantee. You have all
the original data (unordered in the allocation, and ordered in the
deallocation)

I examined the TBB version and the secret is the sample sort algorithm.
The version of TBB is provisional and it's not exception safe. I am working
in the design of a new version of this algorithm exception safe. The actual
version of the parallel stable algorithm is about 20% slower than the TBB
version in the worse case, but when the size of the elements grows, the
indirect version is faster than all the others, with 64 bytes, is around
10% faster than the best TBB version, with 128 bytes , the best version of
TBB is around a 90% slower, and with 256 is 240% slower.

An additional advantage of the indirect sort is they use additional memory
for the pointers, not for the data, and the additional memory is around a
10% of the used by the data.

I hope to have the new algorithm in two weeks, more or less. Then I will
put all in the name space boost::sort, and design all the test and
benchmarks, as commented in the previous message.

Yours

Francisco

Steven Ross

unread,

Jan 13, 2015, 9:52:59 PM1/13/15

to bo...@lists.boost.org

Francisco,

On Tue Jan 13 2015 at 12:43:00 PM Francisco José Tapia <fjt...@gmail.com>
wrote:

> Hi Steven,

>
> This Christmas I had been working hard. I developed a new version of the
> merge sort about 10-15% faster than the previous, using an additional
> memory only ½ of the memory used by the data. The GCC and TBB algorithms
> use an additional memory as the size of the data.
>

>
> This version is exception safe. If an exception occurs due to the
> algorithm ( only throw exceptions in the allocation and deallocation of the
> additional memory), the integrity of the data is guarantee. You have all
> the original data (unordered in the allocation, and ordered in the
> deallocation)
>

> Less memory, comparable speed, and exception safety sounds good.

> I examined the TBB version and the secret is the sample sort algorithm.
> The version of TBB is provisional and it's not exception safe. I am working
> in the design of a new version of this algorithm exception safe. The actual
> version of the parallel stable algorithm is about 20% slower than the TBB
> version in the worse case, but when the size of the elements grows, the
> indirect version is faster than all the others, with 64 bytes, is around
> 10% faster than the best TBB version, with 128 bytes , the best version of
> TBB is around a 90% slower, and with 256 is 240% slower.
>

> Yes, indirect sorting is clearly superior for large data types, and it
would be good to make it as easy as possible to use. It'd be nice to also
have a simple wrapper that can be use the spreadsort algorithm with
indirect sorting too.

I am in the process of integrating the first version of the boost sort
library in https://github.com/boostorg/sort. It would be good to integrate
your testing with the tests in that library.

Thorsten Ottosen

unread,

Jan 14, 2015, 4:49:32 AM1/14/15

to bo...@lists.boost.org

On 14-01-2015 03:52, Steven Ross wrote:

>> I examined the TBB version and the secret is the sample sort algorithm.
>> The version of TBB is provisional and it's not exception safe. I am working
>> in the design of a new version of this algorithm exception safe. The actual
>> version of the parallel stable algorithm is about 20% slower than the TBB
>> version in the worse case, but when the size of the elements grows, the
>> indirect version is faster than all the others, with 64 bytes, is around
>> 10% faster than the best TBB version, with 128 bytes , the best version of
>> TBB is around a 90% slower, and with 256 is 240% slower.
>>
>> Yes, indirect sorting is clearly superior for large data types, and it
> would be good to make it as easy as possible to use. It'd be nice to also
> have a simple wrapper that can be use the spreadsort algorithm with
> indirect sorting too.

Larger, non-fast movable data types?

-Thorsten

Paul A. Bristow

unread,

Jan 14, 2015, 6:59:29 AM1/14/15

to bo...@lists.boost.org

> -----Original Message-----
> From: Boost [mailto:boost-...@lists.boost.org] On Behalf Of Steven Ross
> Sent: 14 January 2015 02:53
> To: bo...@lists.boost.org
> Subject: Re: [boost] [SORT] Parallel Algorithms
>
> Francisco,
>
> On Tue Jan 13 2015 at 12:43:00 PM Francisco José Tapia <fjt...@gmail.com>
> wrote:
>
> > Hi Steven,
> >
> > This Christmas I had been working hard. I developed a new version of
> > the merge sort about 10-15% faster than the previous, using an
> > additional memory only ½ of the memory used by the data. The GCC and
> > TBB algorithms use an additional memory as the size of the data.

<snip>

> > Yes, indirect sorting is clearly superior for large data types, and it
> would be good to make it as easy as possible to use. It'd be nice to also have a
> simple wrapper that can be use the spreadsort algorithm with indirect sorting too.

Would Boost.Multiprecision (50 decimal digits floating-point say) provide a simple demo type?

> I am in the process of integrating the first version of the boost sort library in
> https://github.com/boostorg/sort. It would be good to integrate your testing with
> the tests in that library.

Perhaps a git branch off the new /sort/develop branch would be useful?

I note that Steven Ross is making a separate namespace for spreadsort

boost::sort::spreadsort

Although this seems long (slow typists can provide an alias or a using ... ;-)

I strongly suggest that you should follow this example, perhaps

boost::sort::tbb

though I don't like tbb at all - far too acronymic? Think of a better name, even if longer?

boost::sort::parallel?

Paul

PS The new Sort/Spreadsort docs are almost ready for release.

I will add Quickbook documentation (from plain text notes) for your parallel sort version if you write to me off-list.

---
Paul A. Bristow
Prizet Farmhouse
Kendal UK LA8 8AB
+44 (0) 1539 561830

Francisco José Tapia

unread,

Jan 14, 2015, 11:26:59 AM1/14/15

to Boost List

Hi Thorsten,

The idea about the indirect sort is very easy. Instead of sort the
elements, create a vector of pointers, and sort the vector of pointers,
with this you don't move the elements , move the pointers. And at end with
the vector of pointers sorted, move the data only 1 time.

In the stable sort, the algorithms usually need additional memory. The GCC
and TBB the same memory than the data, in my algorithms only 1 half.
Imagine you want to sort 10 million elements of 128 bytes each. With the
indirect sort you need 10 million pointers and additionally memory for to
allocate 5 million pointers.

The memory used by the data is 1280 Megas, but pointers and the additional
memory are 15 million * 8 = 120 Megas, around a 10% of the memory used by
the data.

In the first messages of this conversation, I sent a zip with code. In
this file you can find the code of the indirect sort. If don't understand,
something, please say me, and I will try to explain.

Paul,

about the name space, if at end, the results are satisfactory, and decide
to include in the library, only say me the name space and I will put. For
me it's indifferent . But, I need time for the new algorithm, and design a
complete test suite of the code, the benchmarks, for to compare with others
versions and the documentation.

If you want a benchmark with multi precision numbers, we can do without
problem, even if you want to check the speed with different sizes of the
numbers

Yours

Francisco

Francisco José Tapia

unread,

Mar 20, 2015, 9:12:08 AM3/20/15

to Boost List

Hi Steven

I have the final version of all the algorithms. I wrote 6 algorithms of
stable sort, and selected the fastest. On my computer is about 15% faster
than the GCC stable sort.

I had problems too with an error in an algorithm mixed with an error in
the GCC dynamic memory system, Under certain sequence of request, the
system provide a non valid address and due this a segmentation fault. The
debug for to detect and correct had been a nightmare.

The algorithms provided are :

1.- sort : The algorithm is intro_sort. Don't need additional memory.
Performance similar to the GCC version and Windows version. Non stable

2.- parallel_sort : Decompose the global range in small ranges for to be
sorted with intro_sort. Don't need additional memory. Similar performance
than the TBB version and Microsoft version. Non Stable

3.- stable_sort : The algorithm is called smart_merge_sort, and is about
10-15% faster than the GCC stable sort version, The additional memory is an
half of the memory used by the data.( The GCC stable sort use the same
memory than the data)

4.- sample_sort : is a parallel stable sort. Have the same performance than
the TBB implementations, and much more faster than the GCC parallel
stable_sort. The additional memory is the same than the data

5.- parallel_stable_sort : This algorithm is based on the sample sort.
Increase the sample_sort time about a 5%, but the additional memory used is
an half of the memory used by the data.

The functions are in the namespàce boost::sort All the algorithms are
exception safe, and use indirect sort when the size of the objects is big.
According to my benchmarks, the size limit for to change a indirect sort is
128 bytes.

The function format of the parallel algorithms are

algorithm ( first, last, compare , number_threads).

The number of threads is passed with an struct NThreads which receives an
unsigned, and return an unsigned. The default construct of this struct is
filled with the HW threads

At today I am testing on Windows with Visual Studio 2013 CTP 4, and
changing some things of the code due to the incompatibility with C++11
features

In a few days I will document all the functions of the code for generate a
doxygen documentation, and can send you a version of the code.

Then we must talk about how to design the benchmark. I think the starting
point can be your experience with your code

The C++ committee propose a format for to call the parallel functions
described in

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3554.pdf

it would be a good idea to examine, and decide if follow the proposal, if
we don't follow or if we provide the two interfaces for the parallel
algorithms. According with my first impression it's not very complex to
implement.

Yours

Francisco Tapia

Steven Ross

unread,

Mar 23, 2015, 6:46:27 AM3/23/15

to bo...@lists.boost.org

On Fri, Mar 20, 2015 at 9:12 AM Francisco José Tapia <fjt...@gmail.com>
wrote:
>

> I have the final version of all the algorithms. I wrote 6 algorithms of
> stable sort, and selected the fastest. On my computer is about 15% faster
> than the GCC stable sort.
>

Could you send them please? I'd like to verify the speedup you're seeing,
and compare vs TBB. I'm ok if it just is comparable in speed to TBB and it
has additional advantages of some type.

>
> I had problems too with an error in an algorithm mixed with an error in
> the GCC dynamic memory system, Under certain sequence of request, the
> system provide a non valid address and due this a segmentation fault. The
> debug for to detect and correct had been a nightmare.
>

One thing to watch out for is undefined operations, such as type
conversions between int and float types; those aren't compiler bugs:
they're undefined (this problem hit me with float_sort, before someone
showed me how to convert safely). I hope you've fixed the bug.

> The algorithms provided are :
>
> 1.- sort : The algorithm is intro_sort. Don't need additional memory.
> Performance similar to the GCC version and Windows version. Non stable

> 2.- parallel_sort : Decompose the global range in small ranges for to be
> sorted with intro_sort. Don't need additional memory. Similar performance
> than the TBB version and Microsoft version. Non Stable
>

Does it provide any advantage over those implementations, aside from being
open-source?

>
> 3.- stable_sort : The algorithm is called smart_merge_sort, and is about
> 10-15% faster than the GCC stable sort version, The additional memory is an
> half of the memory used by the data.( The GCC stable sort use the same
> memory than the data)
>

GCC uses no additional memory, or 100% additional memory?
Did you compare against the tbb stable sort I pointed you to?

>
> 4.- sample_sort : is a parallel stable sort. Have the same performance than
> the TBB implementations, and much more faster than the GCC parallel
> stable_sort. The additional memory is the same than the data
>
> 5.- parallel_stable_sort : This algorithm is based on the sample sort.
> Increase the sample_sort time about a 5%, but the additional memory used is
> an half of the memory used by the data.
>

That's interesting, so it's better from a memory perspective than the
competing algorithms, and not bad in speed? That sounds like it's worth
serious consideration as an addition.

> The functions are in the namespàce boost::sort All the algorithms are
> exception safe, and use indirect sort when the size of the objects is big.
> According to my benchmarks, the size limit for to change a indirect sort is
> 128 bytes.
>
> The function format of the parallel algorithms are
>
> algorithm ( first, last, compare , number_threads).
>
> The number of threads is passed with an struct NThreads which receives an
> unsigned, and return an unsigned. The default construct of this struct is
> filled with the HW threads
>
> At today I am testing on Windows with Visual Studio 2013 CTP 4, and
> changing some things of the code due to the incompatibility with C++11
> features
>
> In a few days I will document all the functions of the code for generate a
> doxygen documentation, and can send you a version of the code.
>
> Then we must talk about how to design the benchmark. I think the starting
> point can be your experience with your code
>

Please look at this code, and refactor/reuse what test code you can:
https://github.com/boostorg/sort
Specifically, look in the test directory, the example directory, and look
in tune.pl. Are there any parameters you would want to tune for specific
systems (besides the number of threads)?
The basic idea is that randomgen writes out a randomized distribution to
disk, and then we run whatever algorithms we want to compare (writing their
results to disk), verify that the results are correct, then compare speed.
I'd prefer not to skip the disk intermediate step because having the files
on disk makes it easier to debug when something gives incorrect results.
We should compare with variable-length strings and ints, in addition to
more complicated data structures like what you've been testing with.
It's worth noting that it's quite common to use vectors for large
quantities of data in a class/struct, in which case the sorting would
already be indirect.

>
> The C++ committee propose a format for to call the parallel functions
> described in
>
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3554.pdf
>
> it would be a good idea to examine, and decide if follow the proposal, if
> we don't follow or if we provide the two interfaces for the parallel
> algorithms. According with my first impression it's not very complex to
> implement.
>

With your implementations, it looks like people have these choices:
1) how many threads to use
2) stable or not
3) memory or speed efficient
4) direct or indirect

Note that #4 may be influenced by #3, and that you may be able to determine
that automatically, but providing people a simple way to make those choices
is a good idea. You could use the approach suggested by the committee if
it fits well, or another one if you have a better approach to this specific
problem.

Steve

Francisco José Tapia

unread,

Mar 24, 2015, 4:56:30 PM3/24/15

to Boost List

Hi ,

In the file of this link, in my dropbox space, you have the code, the test
programs and the benchmarks with GCC.

https://dl.dropboxusercontent.com/u/8437476/works/sort/Sort_V0.1.7z

*FILE DESCRIPTION*

In the root folder of the file you have a INSTRUCTIONS.txt file detailing
the content of all the folders of the file, information for to compile the
programs, where is each file, and which options and libraries you must link.

*Test*

Contains the test programs of each class and algorithms.

*Src*

Contains the source code of all. In the folder *src/boost/sort* there is a
file *algorithm.hpp*, which is the only include file needed for invoke the
algorithms.

The parallel algorithms have a parameter of the NThread type which
indicate the number of threads to use. By default use the same number than
the HW threads of the machine. it's a very simple structure defined in
boost/sort/util/atomic.hpp.

*Benchmarks*

In this folder are the benchmark code with GCC compiler under Linux x64. In
the future I will add benchmarks with other compiler and Operating Systems.

The folders *cpp-TimSort-master*, and *parallel_stable_sort* have the C++
code of the TimSort algorithm, found in
https://travis-ci.org/gfx/cpp-TimSort, and the code of the sample sort with
TBB.

The code had been modified, due to the four versions use the same name
space and many common names, and was not possible invoke the four versions
in the same program. The modification put each version in a different name
space.

In the folder *GCC*, you can find 3 folders

-

*algorithm* : Have the code of the benchmark programs
-

*bin *: have the compiled and static linked version of the benchmark
programs. In this folder you have a shell run.sh which run sequentially the
programs and collect the results obtained in .txt files with the same name
than the program
-

*Results *: in this folder you can find the result generated by the
benchmark programs on different machines. Each machine have a folder where
you can find the text files with the same name than the programs with the
results, and a description.txt file with the description of the HW where
run the benchmarks. The format used in this folder is the suggested for to
present the results of all the machines used in the test. Here you have the
result of my two computers, and can have a first idea about the results
expected in other machines.

The benchmarks, like in the old version, have a version with 64 bits
numbers, and a version with objects of several sizes.

About the memory used for of the algorithms. If you consider M the memory
used by the data, the total memory used by the algorithms are :

-

GCC sort → M
-

boost sort → M
-

GCC parallel_sort → M
-

boost parallel_sort → M
-

TBB parallel_sort → M

-

GCC stable_sort → 2 M
-

boost stable_sort → 1.5 M
-

TimSort -> 2M
-

GCC parallel_stable_sort → 2.5 M
-

TBB parallel_sort → 2 M
-

boost sample_sort → 2 M
-

boost parallel_stable_sort → 1.5 M

The algorithms use the indirect version when the size of the objects is
greater than 128 bytes.

I am speaking with several friends, for to run the benchmarks in more
powerful machines.

I think, the first is to check the speed. If the speed is good, we have
something useful , if not, we must decide what must do with this. If
something is useful , or we must leave all.

If we decide continue, the next steps to do are :

Reorganize the benchmark, and include benchmarks with string objects. Your
idea about the use of files, I thing is the correct way to follow

Finish all the documentation of the code for to generate a Doxygen
documentation

Generate the web pages of documentation for the Boost project

Modify the test programs for to do according the bjam files

If you found any problem , error or doubt, please , contact me for to
resolve

Francisco

Steven Ross

unread,

Mar 24, 2015, 10:32:45 PM3/24/15

to bo...@lists.boost.org

Franciso,

Thanks for your detailed explanation.

I'll repeat what I've said before about your object benchmark:

I suggest changing the struct you use for benchmarking to this:
template <uint32_t NN>
struct reburcio
{
uint64_t M[NN];
reburcio () {
init();
}

void init() {
for (int i = 0; i < NN; ++i) {

M[i] = my_rand(); // or something with less variability to test
identical substring performance
}
}

bool operator < ( const reburcio &R) const{
for (int i = 0; i < NN; ++i) {
if (M[i] != R.M[i]) {
return M[i] < R.M[i];
}
}
return false;
}
};

That way it's using the default copy and assignment operators, which copy
over the entire datastructure, and it initializes and compares all elements
of the array. It appears that the benchmark you sent me only copies over
and compares the first element in the array, which might have caused some
of the large-element differences in performance you were seeing.

Another alternative is just using a vector for the array of elements.

On Tue, Mar 24, 2015 at 4:56 PM Francisco José Tapia <fjt...@gmail.com>
wrote:

>

> -
>
> GCC sort → M
> -
>
> boost sort → M
>

I don't think we'll ship this; this looks more like a proof-of-concept,
unless the indirect sorting benefits are demonstrable after fixing the
object you use in testing as described above.

-
>
> GCC parallel_sort → M
> -
>
> boost parallel_sort → M
>

I see no advantage to this sort relative to tbb, and its performance on
already-sorted data needs to improve to be competitive with tbb.

> -
>
> TBB parallel_sort → M
>
>
> -
>
> GCC stable_sort → 2 M
> -
>
> boost stable_sort → 1.5 M
>

The combination of improved speed and lower memory usage looks impressive.
I'll try to verify this, but this looks promising.

> -
>
> TimSort -> 2M
> -
>
> GCC parallel_stable_sort → 2.5 M
> -
>
> TBB parallel_sort → 2 M
> -
>
> boost sample_sort → 2 M
> -
>
> boost parallel_stable_sort → 1.5 M
>

Both of these algorithms look good. I'm inclined towards
parallel_stable_sort because memory is becoming more of a bottleneck on
modern systems, but I'll have to do some testing.

Timsort seems fairly impressive, but specialized and already available for
free, hence not necessary to include in the library.

>
>
> If you found any problem , error or doubt, please , contact me for to
> resolve
>

Please fix your object benchmarks as described at the beginning of my
email; otherwise I don't trust them. I also would like to see string
sorting speeds. Once that is all together, I'd like to test and review
them carefully and look to see if there are any weaknesses missed by your
current tests. If they look good, and we can agree on which algorithms you
want to add to the library, we should move on to cleaning this up for
review.

Steve

Rob Stewart

unread,

Mar 25, 2015, 4:43:35 AM3/25/15

to bo...@lists.boost.org

On March 24, 2015 10:32:35 PM EDT, Steven Ross <sprea...@gmail.com> wrote:
>
> > TimSort -> 2M
[snip]

> Timsort seems fairly impressive, but specialized and already available
> for
> free, hence not necessary to include in the library.

While "specialized" may be sufficient grounds to avoid including Timsort, its availability elsewhere for free is not a good justification for excluding it from Boost.

Providing a single source of good, varied algorithms, with equivalent documentation including comparative information to help users choose among the options would seem more than ample reason to include an algorithm.

I realize that the more algorithms you include, the greater your documentation and maintenance burden, so you may still choose to reject Timsort on such grounds. I just didn't want you to dismiss it too quickly.

___
Rob

(Sent from my portable computation engine)

Paul A. Bristow

unread,

Mar 25, 2015, 6:01:01 AM3/25/15

to bo...@lists.boost.org

> -----Original Message-----
> From: Boost [mailto:boost-...@lists.boost.org] On Behalf Of Rob Stewart
> Sent: 25 March 2015 08:43
> To: bo...@lists.boost.org
> Subject: Re: [boost] [SORT] Parallel Algorithms
>

> On March 24, 2015 10:32:35 PM EDT, Steven Ross <sprea...@gmail.com> wrote:
> >
> > > TimSort -> 2M
> [snip]
>
> > Timsort seems fairly impressive, but specialized and already available
> > for free, hence not necessary to include in the library.
>
> While "specialized" may be sufficient grounds to avoid including Timsort, its availability
elsewhere for free
> is not a good justification for excluding it from Boost.
>
> Providing a single source of good, varied algorithms, with equivalent documentation including
> comparative information to help users choose among the options would seem more than ample reason
to
> include an algorithm.

+1

Paul

---
Paul A. Bristow
Prizet Farmhouse
Kendal UK LA8 8AB
+44 (0) 1539 561830

Steven Ross

unread,

Mar 25, 2015, 6:05:47 AM3/25/15

to bo...@lists.boost.org

On Wed, Mar 25, 2015 at 4:43 AM, Rob Stewart <rob.s...@verizon.net> wrote:
> On March 24, 2015 10:32:35 PM EDT, Steven Ross <sprea...@gmail.com> wrote:
>>
>> > TimSort -> 2M
> [snip]
>
>> Timsort seems fairly impressive, but specialized and already available
>> for
>> free, hence not necessary to include in the library.
>
> While "specialized" may be sufficient grounds to avoid including Timsort, its availability elsewhere for free is not a good justification for excluding it from Boost.
>
> Providing a single source of good, varied algorithms, with equivalent documentation including comparative information to help users choose among the options would seem more than ample reason to include an algorithm.
>
> I realize that the more algorithms you include, the greater your documentation and maintenance burden, so you may still choose to reject Timsort on such grounds. I just didn't want you to dismiss it too quickly.

Sure, if a Timsort library author wishes to volunteer to include it in
boost, I'll add it.

Francisco José Tapia

unread,

Mar 25, 2015, 3:57:30 PM3/25/15

to Boost List

Hi Steven,

thanks by you fast response. I want clarify several questions of your
message. Please, excuse me my English. It's not my mother tongue, and I
have errors and imprecisions.

TEST OBJECTS

The object used in the test ( Reburcio is a word of an exercise from my
first pascal book, and don't have any sense) In the construction , copy
construction and operator =, copy all the elements with a loop. The only
one operation which use only the first element is the comparison.

I had done a new comparison, and compare the sum of all the elements of the
object. The time are similar, you must add 5% in the small objects to 15%
in the big objects.

TIMSORT

The Tim Sort code is not mine. Tim Sort is used in Java and Android, and is
specially efficient with big objects. ( You can see in
benchmark_single_stable_sort_objects.txt). I use it for to speed compare. I
didn't explore many about Tim Sort, because I saw the methods based on
merge are faster, and I discarded the algorithm.

If you think , it deserve a second opportunity, we can look for new
implementations and check the speed in order to take a decision. I didn't
measure the auxiliary memory used by Tim Sort. Wikipedia talk about an
auxiliary memory O(N), but perhaps to say it's equal to the memory used by
the data is not correct.

In the next benchmark programs, as say Steven, the data are loaded from a
file, and the program execute only 1 algorithm. With this configuration
check the memory used is simple. Now it's very difficult because the
program execute all the algorithms, and it's not possible to know how many
memory use each algorithm.

DESIGN PHASE

Now , we are in the design phase. Nothing is fixed. We can comment and
change anything.

The specification is simple, provide sort algorithms, stable and not
stable, for to run with 1 thread or with any number of threads, without any
other external library ( as Open MP , TBB, CilkPlus..). Only with the C++
standard.

If you want to discuss about this, we can do. We admit ideas, suggestions,
opinions. Any help is welcome. The final decision about if the code is
included or not in the boost sort library, depend of Steven, the creator of
the library.

After the design phase, with the decisions taken and the interfaces fixed.
We prepare the final version, redesign a test system, test in deep and
create the documentation.

ABOUT MYSELF

This SW provide of other project for the Boost library ( Contertree,
concurrent red-black trees, with access by position , like the vectors),
pending the final revision since near two years ago.

When began, the implementation of parallel non stable sort, have similar
performance than the GCC version, based on OpenMP, and the TBB version.

The stable sort with 1 thread was slower than the GCC version, and the
parallel stable sort was faster than the GCC version based on OpenMP.

When someone mentioned the TBB sample sort implementation, even when is
experimental, don't destroy the object in the auxiliary memory, and it's
not exception safe. We began a crisis. That version was 20-30% faster than
our version. I want tho show my admiration by TBB and their team. I think
they are the best doing concurrent SW.

I tried to improve, I would like to have more time, more equipment, and
better documentation on order to find the best options, but I don't have,
and I do this in my free time after my work and my family.

I designed 6 stable sort algorithms, for to select the fast (internally
named smart_merge_sort, about 10% faster than the GCC version). I prepared
a version of Sample Sort, only with threads and atomic variables. In the
benchmarks have similar performance than the TBB version, use one half of
the auxiliary memory used by TBB and is exception safe.

In my opinion, the parallel stable sort is reasonably good for to be
included in the final version. The important case, which shows the quality
of the algorithm, is when the data are unsorted. I will try to implement
the detection of the sorted elements , like in TBB, but I am not sure about
the results. Th internal loops of the algorithm are delicate, and a small
change , can produce big changes in the performance, as happened in other
algorithms

We are in the design phase. We must test with others machines, big and
small. Perhaps we must change or redesign things. But we need help. If
someone test the programs, please send me the file result and the machine
description ( Processor , number of cores,speed, memory , cache ) in order
to obtain conclusions, fix the code and pass to the next phase.

Thanks in advance

Francisco

Jeremy Murphy

unread,

Mar 26, 2015, 2:15:20 AM3/26/15

to boost

Hi Francisco,

On 17 December 2014 at 05:42, Francisco José Tapia <fjt...@gmail.com>
wrote:

>

> SCALABILITY : Some algorithms don't scale well when the number of threads
> grow. Increasing the number of threads don't decrease the execution time.
> This lack of scalability is emphasized when the size of the elements to
> sort grows

Sorry about butting in so late, but on the topic of scalability and
potential superiority to TBB, you might be interested to check out Hartmut
Kaiser's talk on asynchronous computing (
https://www.youtube.com/watch?v=5xyztU__yys) and his lab's library, HPX (
http://stellar.cct.lsu.edu/projects/hpx/).
Cheers.

Jeremy

Francisco José Tapia

unread,

Mar 26, 2015, 9:07:52 AM3/26/15

to Boost List

Thanks,
I will see. Always is interested to hear Mr Kaiser, and learn new things
from their huge knowledge

Francisco

Steven Ross

unread,

Mar 26, 2015, 10:32:26 PM3/26/15

to bo...@lists.boost.org

Hi Francisco,

On Wed, Mar 25, 2015 at 3:57 PM Francisco José Tapia <fjt...@gmail.com>
wrote:

> TEST OBJECTS

>
> The object used in the test ( Reburcio is a word of an exercise from my
> first pascal book, and don't have any sense) In the construction , copy
> construction and operator =, copy all the elements with a loop. The only
> one operation which use only the first element is the comparison.
>
> I had done a new comparison, and compare the sum of all the elements of the
> object. The time are similar, you must add 5% in the small objects to 15%
> in the big objects.
>

I'm not precisely sure of your meaning, but I suggested modification of the
test object to use default operators where possible, to randomize the
entire contents of the object (so that not all integers in the array are
the same), and to use the entire contents of the object for comparison. I
believe that these suggested changes make it more reflective of the
conventional large-object comparison case. Most of the time, only the
first integer will be compared, but in equal cases the entire array will be
compared. I don't like feeding overly simplified objects into benchmarks,
because compilers can make some surprising optimizations.

> TIMSORT
>
> The Tim Sort code is not mine.
>

Agreed. I think we can drop it from this comparison, as your algorithm is
clearly much faster. I mentioned it because it is used with other
languages, and responded when somebody else on the email group suggested I
allow it. I'm not asking you for an implementation.
Timsort uses N/2 additional memory.

> DESIGN PHASE
>
> Now , we are in the design phase. Nothing is fixed. We can comment and
> change anything.
>
> The specification is simple, provide sort algorithms, stable and not
> stable, for to run with 1 thread or with any number of threads, without any
> other external library ( as Open MP , TBB, CilkPlus..). Only with the C++
> standard.
>
> If you want to discuss about this, we can do. We admit ideas, suggestions,
> opinions. Any help is welcome. The final decision about if the code is
> included or not in the boost sort library, depend of Steven, the creator of
> the library.
>
> After the design phase, with the decisions taken and the interfaces fixed.
> We prepare the final version, redesign a test system, test in deep and
> create the documentation.
>

I suggest deciding what you want to try to include. Getting a competitive
parallel unstable sort to tbb will be hard, and I'm skeptical of including
a library that just replicates what tbb provides for easy use.

> I tried to improve, I would like to have more time, more equipment, and
> better documentation on order to find the best options, but I don't have,
> and I do this in my free time after my work and my family.
>

I'm in the same situation with regards to available time. I think many
boost participants are.

>
> I designed 6 stable sort algorithms, for to select the fast (internally
> named smart_merge_sort, about 10% faster than the GCC version). I prepared
> a version of Sample Sort, only with threads and atomic variables. In the
> benchmarks have similar performance than the TBB version, use one half of
> the auxiliary memory used by TBB and is exception safe.
>

I like your parallel stable sorts. They look like the most promising part
of your proposed library additions. Indirect sorting is also interesting.
I just want reliable large-object and string comparison numbers for them
before diving in further.

Have you looked at integrating your testing with the approach in the Boost
Sort library? The randomgen binary generates files with different types of
random number distributions that can test a variety of scenarios, and the
tune.pl script controls automated performance testing.

>
> In my opinion, the parallel stable sort is reasonably good for to be
> included in the final version. The important case, which shows the quality
> of the algorithm, is when the data are unsorted. I will try to implement
> the detection of the sorted elements , like in TBB, but I am not sure about
> the results. Th internal loops of the algorithm are delicate, and a small
> change , can produce big changes in the performance, as happened in other
> algorithms
>

> Is there a way you can do a fast initial pass, checking for already-sorted
data? That's what Spreadsort does (at each level of the sort).

Francisco José Tapia

unread,

Mar 29, 2015, 6:25:22 AM3/29/15

to Boost List

Hi Steven

Thanks by your message.

Until now , I had been focused mainly in the algorithms, and I didn't
dedicate many time to others things as the test format and the integration
with boost::sort.Now, the first things to do are :

- Try to detect when the data are ordered,
- Revise in deep your library and code , in order to adapt the test and
benchmarks to the procedures used in your library

About the object used in the benchmarks, we can do something simple.
Reburcio is a very small class in the file
Benchmarks/GCC/algorithm/reburcio.hpp. Please rewrite and send me, and only
need to recompile the benchmarks.

*About the parts to include*

With the parallel unstable sort, all the algorithms I examined have the
same approach, and due this, similar performances. The range of decision is
small.

The goal is provide algorithms independent of any library or Operating
System, fast , robust and easy to use. The idea of to do the same than a
company is the origin of many Free SW, think about Oracle and MySQL,
Internet Explorer and Firefox. Even the C++ standard, is to do the same
than many companies are doing since many years ago, each with its own way.
The final users are grateful because simplify our work.

TBB is available in Windows, Linux and Os X. With others operating system
( by example Android and all the real time OS), you must recompile all the
source code, and I have not clear about the dynamic linking of TBB in these
operating systems. Many small machines don't have task scheduler, but have
threads. To force to recompile TBB for to use a parallel sort is like force
to rent a bus, for to transport only one person.

Our code have similar performance, small code, independent of any Operating
System, of any other code or library. If you are C++11 compliant you have
parallel sort. I think, we must include sort, parallel sort, stable sort
and parallel stable sort. Perhaps, too, sample sort, but it's less
important

Now I am beginning to examine your code for to integrate the new code with
the boost sort approach.

Yours

Francisco

Steven Ross

unread,

Apr 5, 2015, 11:13:26 PM4/5/15

to bo...@lists.boost.org

On Sun, Mar 29, 2015 at 6:25 AM Francisco José Tapia <fjt...@gmail.com>
wrote:

> Hi Steven

>
> Thanks by your message.
>
> Until now , I had been focused mainly in the algorithms, and I didn't
> dedicate many time to others things as the test format and the integration
> with boost::sort.Now, the first things to do are :
>
> - Try to detect when the data are ordered,
>

Just verify that each successive element is not less than the previous. If
inputs are unique (or completely identical if they compare the same), then
you can compare results with those of std::sort.

> - Revise in deep your library and code , in order to adapt the test and
> benchmarks to the procedures used in your library
>
> About the object used in the benchmarks, we can do something simple.
> Reburcio is a very small class in the file
> Benchmarks/GCC/algorithm/reburcio.hpp. Please rewrite and send me, and
> only
> need to recompile the benchmarks.
>

I suggest renaming the variables to be clearer, but here's the idea:

template <uint32_t NN>
struct intarray
{
uint32_t M[NN];

bool operator < ( const intarray &R) const{

for (int i = 0; i < NN; ++i) {
if (M[i] != R.M[i]) {
return M[i] < R.M[i];
}
}
return false;
}
};

Use <BOOST_ROOT>/libs/sort/example/randomgen.cpp (b2 libs/sort/randomgen
--tune will build the binary in the libs/sort dir) to generate randomized
input files, and memcpy them into arrays of intarray to fill them with the
data to sort. This will emulate the large non-vector objects the indirect
sorting approach you've designed is optimized for, and will fill the entire
object with randomized contents, along with having a full (default) copy
constructor.

You can use strings in the place of intarray also for testing strings; I
randomize them by just using the >> operator on strings to read them in; it
will automatically break on some characters, causing the strings to have
randomized length.

>
>
> *About the parts to include*
>
> With the parallel unstable sort, all the algorithms I examined have the
> same approach, and due this, similar performances. The range of decision is
> small.
>
> The goal is provide algorithms independent of any library or Operating
> System, fast , robust and easy to use. The idea of to do the same than a
> company is the origin of many Free SW, think about Oracle and MySQL,
> Internet Explorer and Firefox. Even the C++ standard, is to do the same
> than many companies are doing since many years ago, each with its own way.
> The final users are grateful because simplify our work.
>
> TBB is available in Windows, Linux and Os X. With others operating system
> ( by example Android and all the real time OS), you must recompile all the
> source code, and I have not clear about the dynamic linking of TBB in these
> operating systems. Many small machines don't have task scheduler, but have
> threads. To force to recompile TBB for to use a parallel sort is like force
> to rent a bus, for to transport only one person.
>

Who wants to do a parallel sort on Android? The OS often only gives you
one core (depending on priorities), and it would burn the battery at
ridiculous speed. I just don't see the use case, unless you can provide
some advantage in some fairly common use case, and it would have to fully
match tbb's features to within a few percent, including efficiency on
already-sorted input.

> Our code have similar performance, small code, independent of any Operating
> System, of any other code or library. If you are C++11 compliant you have
> parallel sort. I think, we must include sort, parallel sort, stable sort
> and parallel stable sort. Perhaps, too, sample sort, but it's less
> important
>

> Your stable sort and parallel stable sort look promising, and sample sort
is a possibility. I recommend concentrating on them for now.

Ben Pope

unread,

Apr 6, 2015, 1:37:20 AM4/6/15

to bo...@lists.boost.org

On Monday, April 06, 2015 11:13 AM, Steven Ross wrote:
> Who wants to do a parallel sort on Android? The OS often only gives you
> one core (depending on priorities), and it would burn the battery at
> ridiculous speed.

I was under the impression that it's better on battery life to use all
the cores at maximum and then sleep as quickly as possible. Clock gating
of the components is getting better, but for example (and I'm completely
making this up now), if two CPUs share a cache, then the cache is alive
anyway so you may as well use the other CPU. I've heard the term "race
to sleep" to describe this.

I think it's hard to guess at; mobile CPUs now have 8 cores, and they're
not even homogeneous, some are in order execution and some out of order.

Ben

Aditya Atluri

unread,

Apr 6, 2015, 1:55:33 AM4/6/15

to bo...@lists.boost.org

It's not just CPUs. Most of the physics engines used in games for phones are CPU based. The current multi threaded CPU architecture brings gamin to a whole new level. I see some physics done with sort, map, gather, scatter, and all.

PS: I proposed this as a part of boost compute this GSoC giving GPU support for iOS devices using metal shading language.

Regards,
Aditya Atluri.

Steven Ross

unread,

Apr 6, 2015, 6:54:04 AM4/6/15

to bo...@lists.boost.org

On Mon, Apr 6, 2015 at 1:55 AM, Aditya Atluri <ab...@gwmail.gwu.edu> wrote:
> It's not just CPUs. Most of the physics engines used in games for phones are CPU based. The current multi threaded CPU architecture brings gamin to a whole new level. I see some physics done with sort, map, gather, scatter, and all.
>
> PS: I proposed this as a part of boost compute this GSoC giving GPU support for iOS devices using metal shading language.

If you can find somebody whose frame rate on mobile is significantly
impacted by CPU sorting speed, I'd love to chat with them.

>> On Apr 6, 2015, at 1:37 AM, Ben Pope <benp...@gmail.com> wrote:
>>
>>> On Monday, April 06, 2015 11:13 AM, Steven Ross wrote:
>>> Who wants to do a parallel sort on Android? The OS often only gives you
>>> one core (depending on priorities), and it would burn the battery at
>>> ridiculous speed.
>>
>> I was under the impression that it's better on battery life to use all the cores at maximum and then sleep as quickly as possible. Clock gating of the components is getting better, but for example (and I'm completely making this up now), if two CPUs share a cache, then the cache is alive anyway so you may as well use the other CPU. I've heard the term "race to sleep" to describe this.
>>
>> I think it's hard to guess at; mobile CPUs now have 8 cores, and they're not even homogeneous, some are in order execution and some out of order.

The term race-to-sleep applies to faster but higher-power processors
sometimes being more power efficient because the system can go to
sleep faster. That said, as modern mobile systems normally operate
with all but one core asleep, they are optimized for the single-core
use case, and very little power is wasted on the unused CPUs. When
you add in that parallel sorting has 75% of the CPU efficiency, it
just doesn't make sense from a power perspective especially if some
other task can be computed in parallel with the sort, and as there is
a single-threaded option (spreadsort) that can sort 1.5-2X faster on
just one core.

Olaf van der Spek

unread,

Apr 6, 2015, 8:55:01 AM4/6/15

to bo...@lists.boost.org

On Mon, Apr 6, 2015 at 12:53 PM, Steven Ross <sprea...@gmail.com> wrote:
> The term race-to-sleep applies to faster but higher-power processors
> sometimes being more power efficient because the system can go to
> sleep faster. That said, as modern mobile systems normally operate
> with all but one core asleep,

Are you sure about that?

> they are optimized for the single-core
> use case, and very little power is wasted on the unused CPUs.

--
Olaf

Aditya Atluri

unread,

Apr 6, 2015, 10:06:18 AM4/6/15

to bo...@lists.boost.org

Regards,
Aditya Atluri.

> On Apr 6, 2015, at 6:53 AM, Steven Ross <sprea...@gmail.com> wrote:
>
>> On Mon, Apr 6, 2015 at 1:55 AM, Aditya Atluri <ab...@gwmail.gwu.edu> wrote:
>> It's not just CPUs. Most of the physics engines used in games for phones are CPU based. The current multi threaded CPU architecture brings gamin to a whole new level. I see some physics done with sort, map, gather, scatter, and all.
>>
>> PS: I proposed this as a part of boost compute this GSoC giving GPU support for iOS devices using metal shading language.
>
> If you can find somebody whose frame rate on mobile is significantly
> impacted by CPU sorting speed, I'd love to chat with them.

Unreal engine has gone open source. Check it. OpenSubDiv is pixars surface subdivision software widely used in movie production and games (COD Ghosts). There are backends for android and iOS. Last time I checked it should have ARM NEON based backend(need recheck).

It is a common practice in gaming industry to use CPU for physics (before CUDA and OpenCL came along. but it is still in practice).

I don't know anyone, but you can contact the developers in GitHub repositories.

Aditya Atluri

unread,

Apr 6, 2015, 10:12:53 AM4/6/15

to bo...@lists.boost.org

Regards,
Aditya Atluri.

> On Apr 6, 2015, at 6:53 AM, Steven Ross <sprea...@gmail.com> wrote:
>

>> On Mon, Apr 6, 2015 at 1:55 AM, Aditya Atluri <ab...@gwmail.gwu.edu> wrote:
>> It's not just CPUs. Most of the physics engines used in games for phones are CPU based. The current multi threaded CPU architecture brings gamin to a whole new level. I see some physics done with sort, map, gather, scatter, and all.
>>
>> PS: I proposed this as a part of boost compute this GSoC giving GPU support for iOS devices using metal shading language.
>
> If you can find somebody whose frame rate on mobile is significantly
> impacted by CPU sorting speed, I'd love to chat with them

FYI,
Most of the rendering applications, have asynchronous ability. Like, after issuing a draw call, the gpu takes time to render the scene. Meanwhile, the CPU does all the physics calculation. Doing physics on the GPU causes increase in frame rendering time. This is one of the reason why mobile devices are still using CPU based Physics, AI systems. The design call of using CPU or GPU depends on the developer by testing the frame rate to be reasonable. If the frame rates are high, he may chose to go for GPU based Physics or AI. Else he sticks to CPU.

Francisco José Tapia

unread,

Apr 6, 2015, 12:52:58 PM4/6/15

to Boost List

Hi,

About to change the name and use intarray, it's ok , I am changing now. I
was looking the code of randomgen and I didn't understood how is running.
With the explanation about the operator >>, I understand and I will use in
the test implementation.

About Android, I recommend you to see the Shield consoles from Nvidia. The
Sw inside these consoles is comparable with any Linux workstation.

But, perhaps we have not the correct perspective about this. We don't
pretend to say to the programmers what must use, and what not. We pretend
provide tools, in order to simplify their work, and use or not is their
decision.

In the actual implementation of parallel sort algorithms in TBB or with
OpenMP, the way for to select only a part of the HW Threads is not clear.
Recently Ezchip announces a processor with 100 ARM cores. Our
implementation have a parameter for to select the number of HW threads to
use, by default use all. But perhaps want to use only 40 HW threads, or a
percent of the HW threads of the machine where is running the program.

In the future, we must involve the GPU, but, in my opinion, first, we
must obtain fast and robust parallel algorithms over the CPU.

Francisco

Steven Ross

unread,

Apr 7, 2015, 6:42:46 AM4/7/15

to bo...@lists.boost.org

On Mon, Apr 6, 2015 at 10:12 AM Aditya Atluri <ab...@gwmail.gwu.edu> wrote:

>
>
> Regards,
> Aditya Atluri.
>
> > On Apr 6, 2015, at 6:53 AM, Steven Ross <sprea...@gmail.com> wrote:
> >
> >> On Mon, Apr 6, 2015 at 1:55 AM, Aditya Atluri <ab...@gwmail.gwu.edu>
> wrote:
> >> It's not just CPUs. Most of the physics engines used in games for
> phones are CPU based. The current multi threaded CPU architecture brings
> gamin to a whole new level. I see some physics done with sort, map, gather,
> scatter, and all.
> >>
> >> PS: I proposed this as a part of boost compute this GSoC giving GPU
> support for iOS devices using metal shading language.
> >
> > If you can find somebody whose frame rate on mobile is significantly
> > impacted by CPU sorting speed, I'd love to chat with them
>
> FYI,
> Most of the rendering applications, have asynchronous ability. Like, after
> issuing a draw call, the gpu takes time to render the scene. Meanwhile, the
> CPU does all the physics calculation. Doing physics on the GPU causes
> increase in frame rendering time. This is one of the reason why mobile
> devices are still using CPU based Physics, AI systems. The design call of
> using CPU or GPU depends on the developer by testing the frame rate to be
> reasonable. If the frame rates are high, he may chose to go for GPU based
> Physics or AI. Else he sticks to CPU.
>

> I understand that they're doing physics calculations, but are those
physics calculations containing a substantial amount of purely sequential
sort time, where they can't run other computations in parallel? If so, I'd
be interested in talking with them to understand what they're doing.

Reply all

Reply to author

Forward