On Sun, Dec 14, 2014 at 1:03 PM, Francisco José Tapia <fjt...@gmail.com> wrote:
> Sorry by the delay, As promised, the sorting parallel algorithms
>
> On my computer (Quad Core) the GCC std::sort and stable_sort are slight
> faster than my version of intro_sort and merge_sort (stable sort is a
> synonym of merge_sort). In other computer with an Intel I3, my algorithms
> are slight faster than in the GCC version.
An I3 is low-end hardware. I recommend testing on an i7 or faster system.
> This code is a preliminary version, done for my countertree library. If
> you consider useful, I will change the name space and document the code,
> the algorithms and the ideas obtained from the benchmarks.
>
> Please feel free for ask me any question or for to modify or change
> anything. The goal is to provide good sort methods to the users.
>
> In the zip file you have the code and a file with a description of each
> code.
First, thanks for sending this in a fairly easy to test format.
I suggest updating your docs to recommend single-line compilation (if
appropriate), which saves a step:
Sort/Modified/benchmark/GCC/util_sort$ g++ -std=c++11 -Wall
-fexceptions -fopenmp -fomit-frame-pointer -fexpensive-optimizations
-O3 -I../../../src benchmark_sort_03.cpp -fopenmp -s -lpthread -ltbb
-o benchmark_sort_03 -pthread
I was unable to run your indirect sort benchmark:
Sort/Modified/benchmark/GCC/util_sort$ g++ -std=c++11 -Wall
-fexceptions -fopenmp -fomit-frame-pointer -fexpensive-optimizations
-O3 -I../../../src -c benchmark_sort_indirect.cpp -o
benchmark_sort_indirect.o -pthread
Sort/Modified/benchmark/GCC/util_sort$ g++ -o benchmark_sort_indirect
benchmark_sort_indirect.o -pthread -fopenmp -s -lpthread -ltbb
Sort/Modified/benchmark/GCC/util_sort$ ./benchmark_sort_indirect
Size of element 8 Number of elements :100000000
terminate called after throwing an instance of 'std::system_error'
what(): Enable multithreading to use std::thread: Operation not permitted
Aborted (core dumped)
I'm actually quite interested in the generalized indirect sort idea (I
looked at your two functions). Do you think you could package that up
so it's a simple wrapper, where the user passes in the sort function
they want to use, and it takes care of changing the input and
comparison (if possible) to be indirect, running the sort, and then
swapping into place based on the results of the indirect sort? If
that can be done without significant loss of efficiency, it would seem
quite useful (I've had do indirect sorts multiple times, and it's
always some code). Even your two functions look useful on their own.
I suggest using the boost random number generator (or if you stick
with the c++11 requirement, which I don't recommend, the std::
equivalent):
#include <boost/random/mersenne_twister.hpp>
#include <boost/random/uniform_int_distribution.hpp>
random::mt19937 generator;
random::uniform_int_distribution<int> distribution;
And for each random number:
int result = distribution(generator)
You may want to reuse and modify some of my tests in
https://github.com/spreadsort/sort/tree/master/example
rand() is returning lots of duplicate elements, as you're sorting more
than RAND_MAX elements.
I'm not sure why you're putting so much effort into competing with
stable_sort. Some people use it, but I've never seen it in actual
production code.
std::stable_sort is not just merge sort, as based on the documentation
it can use less memory than merge sort:
http://www.cplusplus.com/reference/algorithm/stable_sort/
Complexity
If enough extra memory is available, linearithmic in the distance
between first and last: Performs up to N*log2(N) element comparisons
(where N is this distance), and up to that many element moves.
Otherwise, polyloglinear in that distance: Performs up to N*log22(N)
element comparisons, and up to that many element swaps.
The speed results I get on my system for your library are below. It
just doesn't seem competitive with tbb for speed on my system. If you
get it there (within 5% on random, sorted, and mostly-sorted data on
both Windows and Linux), or if you find someone else with a genuine
need to use this in production code, I'll be interested to take
another look. This is compiled using g++ (Ubuntu 4.8.2-19ubuntu1)
4.8.2 and run on a Intel(R) Core(TM) i7-3612QM CPU @ 2.10GHz:
max. OpenMP threads:8 (processors)8
Sort of 200000000 elements in a vector -------------------
Sorted elements -----------------------------------------------
STL std::sort :3.60897 secs
parallel::sort :1.24468 secs
tbb::parallel_sort :0.059404 secs
countertree::parallel_sort :1.25563 secs
std::stable_sort :10.3108 secs
parallel::stable_sort :3.1866 secs
countertree::parallel_merge_sort :3.71677 secs
Reverse sorted elements ---------------------------------------
STL std::sort :2.57237 secs
parallel::sort :1.02421 secs
tbb::parallel_sort :0.895415 secs
countertree::parallel_sort :1.30754 secs
std::stable_sort :10.4286 secs
parallel::stable_sort :3.21695 secs
countertree::parallel_merge_sort :4.08913 secs
Random elements, few repeated ( rand() )-----------------------
STL std::sort :19.7451 secs
parallel::sort :3.98043 secs
tbb::parallel_sort :4.28722 secs
countertree::parallel_sort :6.85041 secs
std::stable_sort :18.6291 secs
parallel::stable_sort :4.08416 secs
countertree::parallel_merge_sort :6.09388 secs
Random elements, quite repeated ( rand() % (NELEM/2) )---------
STL std::sort :19.5069 secs
parallel::sort :3.93207 secs
tbb::parallel_sort :4.66967 secs
countertree::parallel_sort :6.29049 secs
std::stable_sort :18.8519 secs
parallel::stable_sort :4.31173 secs
countertree::parallel_merge_sort :6.1165 secs
Random element many repeated (rand() % 10000)--------------------
STL std::sort :11.8256 secs
parallel::sort :2.65534 secs
tbb::parallel_sort :3.44815 secs
countertree::parallel_sort :3.79712 secs
std::stable_sort :18.7368 secs
parallel::stable_sort :3.70629 secs
countertree::parallel_merge_sort :6.09851 secs
Equal elements --------------------------------------------------
STL std::sort :4.50343 secs
parallel::sort :1.44262 secs
tbb::parallel_sort :0.0588514 secs
countertree::parallel_sort :1.53466 secs
std::stable_sort :10.382 secs
parallel::stable_sort :3.22236 secs
countertree::parallel_merge_sort :4.17672 secs
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
On Tue Dec 16 2014 at 1:42:59 PM Francisco José Tapia <fjt...@gmail.com>
wrote:
> Hi Steven,
>
> I thought all the program was checked, but obviously not. Now in the
> folder you have a shell script with the command for to compile and link
> with optimization ( by example make_benchmark_sort_03.sh), and a
> make_all.sh which compile all. Open a terminal in the folder, and compile
> and run
>
> Great! Thanks. I ran everything, and the only failure was with
benchmark_sort_indirect:
./benchmark_sort_indirect
Size of element 8 Number of elements :100000000
terminate called after throwing an instance of 'std::system_error'
what(): Enable multithreading to use std::thread: Operation not permitted
Aborted (core dumped)
>
> As suggested me, change the number generator and now use a Mersenne of the
> standard library.
>
> Thanks
> The stable sort is less used than the non stable sort, but sometimes is
> important. I developed because in my project have a proxy , which receives
> the operations over the elements defined by a key. The proxy sort, and send
> the operations to the corresponding thread, and it is not the same, read
> the value of a key and after delete that key, than delete the key and after
> try to read the value. I can't add a time mark for to know which arrived
> before
>
> Fair enough.
> As say in the previous message, the stable sort, with 1 thread need 2.7,
> and with 4 HW threads need 8.49, around 3 times the time of 1 thread. And
> the countertree::merge_sort need 2.7 for to sort with 1 thread and with 4
> HW threads need 4.89, secs ( See the [Original test_stress_sort.cpp ] )
>
> This show countertree::merge_sort is a good for to use in a parallel SW
> and GCC std::stable_sort is bad.
>
The numbers I see are not as extreme, but your merge_sort looks good in
comparison, especially with large elements. I will investigate
alternatives to see if this is a truly speed-competitive offering.
Have you considered Timsort (standard in Java and Python)? Timsort is a
stable sort that is particularly good with sorted, mostly-sorted, and
reverse-sorted data. It might be worth offering as an alternative stable
sort for those concerned about mostly-sorted data.
Your intro_sort looks reasonable overall, but it varies from comparable
speed to 14% slower when run in parallel on random data, and is an order of
magnitude slower on already-sorted data. As tbb is free, I just don't see
a compelling case for your introsort implementation; tbb is a very hard
competitor to beat.
The GCC std::sort and countertree::intro_sort have similar performance
> with all the size of elements. But in the parallel process, when the size
> of the elements grow, the poor scalability appear. The next lines are are
> from Original benchmark_sort_05.cpp on my computer
>
>
> With 128 bytes elements
>
> Random elements, few repeated -----------------------
>
> STL std::sort :2.9188 secs
>
> countertree::intro_sort :2.87864 secs
>
> parallel::sort :2.46113 secs
>
> tbb::parallel_sort :1.86458 secs
>
> countertree::parallel_sort:1.89263 secs
>
>
> With 256 bytes elements
>
> Random elements, few repeated ( rand() )-----------------------
>
> STL std::sort :3.03143 secs
>
> countertree::intro_sort :3.02733 secs
>
> parallel::sort :2.23231 secs
>
> tbb::parallel_sort :1.76186 secs
>
> countertree::parallel_sort:1.56509 secs
>
> As we can see the parallel process is ineffective with GCC parallel::sort
> with big objects.
>
> Those don't match up with my results for 256 byte elements:
STL std::sort :1.04718 secs
countertree::intro_sort :1.14214 secs
parallel::sort :0.679178 secs
tbb::parallel_sort :0.600001 secs
countertree::parallel_sort:0.627144 secs
And this is what I see for size 64:
STL std::sort :1.6577 secs
countertree::intro_sort :1.73023 secs
parallel::sort :0.963199 secs
tbb::parallel_sort :0.873805 secs
countertree::parallel_sort:0.997228 secs
I'm not sure the benefits you are seeing are portable.
>
> About the merge_sort algorithm. It use additionally half of the memory
> used by the data. But if you don't use additional memory, the speed drops.
> I didn't examine in deep the Algorithm used in GCC, I only codified my own
> algorithm. I will examine, and I will try to reduce the memory used,
> without lost of speed and scalability.
>
The standard algorithm says it will lose speed when it reduces memory used;
it'd be a nice feature to have, but if we can repeatably see 2X relative
speedups without that feature, I think some people may find it useful.
>
> About the server for to test. I don't have modern servers in my school.
> The budget cut eliminate the new servers. I have an I7 Xeon , but is a 3
> years old machine with 4 HW threads, and in the benchmarks is slower than
> the I3 and I5 machined mounted this year. If someone can check and provide
> us the result, I will be grateful. The version 4.9.1 of GCC have better
> optimization than the 4.8.2. If you are using Ubuntu, the version 14.10
> have the 4.9.1 version as predefined.
>
> A boost library should still be useful for people at least one compiler
version back; I feel fully justified testing on 4.8.2. Have you tested on
Windows yet?
With all the information, we can decide the best solution if exist.
Thanks
Francisco Tapia
On Mon Dec 22 2014 at 5:04:17 PM Francisco José Tapia <fjt...@gmail.com>
wrote:
> > I had been working about the memory used by the different algorithms,
> > specially the stable algorithms. I developed a low memory algorithm
> > circular_stable_sort, which use only a 10% more of memory, and the speed
> > decrease around 15% with small objects and with big objects is similar to
> > the merge_sort
>
> Some people may find that useful. What is the worst-case computational
complexity? Is it a well-known algorithm? The bar is going to be higher
(may require a full review) for previously unknown algorithms.
> > I was looking too, the TimSort algorithm. I found a C++ implementation
> > in https://github.com/gfx/cpp-TimSort .
> >
> > I inserted this algorithm in the benchmark programs. With small objects,
> > the speed is slower than the stable sort and merge_sort, and have a good
> > performance with big objects, greater than 64 bits. It's logic the
> decision
> > of Java and Python, because they must order objects, usually non
> contiguous
> > and bigger than the naked number.
> >
> > For the measure of the memory used by the program I use the command
> > /usr/bin/time -v program_to_execute. I take the benchmark_algorithm_04
> > (100000000 uint64_t numbers), and comment all the invocations to the
> > algorithms , except one, compile, and run in a terminal with the command
> > /usr/bin/time -v benchmark_sort_04.
> >
> > I repeat this process with all the algorithms for to know the memory
> > used by each algorithm.
>
I suggest changing the struct you use for benchmarking to this:
template <uint32_t NN>
struct reburcio
{
uint64_t M[NN];
reburcio () {
init();
}
void init() {
for (int i = 0; i < NN; ++i) {
M[i] = my_rand();
}
}
bool operator < ( const reburcio &R) const{
for (int i = 0; i < NN; ++i) {
if (M[i] != R.M[i]) {
return M[i] < R.M[i];
}
}
return false;
}
};
That way it's using the default copy and assignment operators, which copy
over the entire datastructure, and it initializes and compares all elements
of the array. It appears that the benchmark you sent me only copies over
and compares the first element in the array, which might have caused some
of the large-element differences in performance you were seeing.
With this data structure, countertree intro_sort is competitive on my linux
system with std::sort on fully random data, and countertree parallel_sort
is competitive for 8, 16, and 256 bytes per element, but significantly
underperforms (10-15%) tbb for 32 and 64 bytes.
> > As you can see with a 256 bytes of size the indirect
> > parallel_stable_sort is 3 times faster. And the indirect_sort is a 25%
> > faster.
>
Yes, indirect sort should eventually be faster, given a large enough
element, especially if indirect sort is done with the fastest underlying
algorithm available.
> >
> > The final solution, if at end is taken, will be a hybrid algorithm
> > depending of the size of the elements
>
That sounds reasonable for fixed-size elements. What about variable-length
elements, like strings? Will this only be for fixed-size elements?
> > I propose you the next. Let me 2 or 3 weeks and I will prepare a big
> > benchmark with all the algorithms, including the indirect version of the
> > stable and unstable sort. We will run the benchmarks over different
> > machines, and according the results, take a decision about if we propose
> > something to Boost, and which algorithm must be used in the hybrid
> version.
> >
> > If, at end, we decide don't propose nothing, I only can say this had
> > been a nice and exciting experience
>
I think indirect sorting is promising.
> >
> > Mi intention is not to impose my code, I want provide to the final users
> > the best solution, if exist. With my code or with the code of other. I am
> > the first to clap the winner.
> >
> > We are doing this because this parallel implementation don't need any
> > other library, only the standard. There are compilers without OpenMP or
> TBB
> > support, or simply, the user don't want to load them for to make a sort.
>
> Agreed, but OpenMP is very common (even Android supports it)!, so systems
without OpenMP are a very small niche in terms of total systems out there.
TBB is supported on intel-compatible processors, and the vast majority of
deployed processors where parallel sorting makes sense (won't drain the
battery really quickly) are intel-compatible. If we're going to provide an
alternative, it should cost very little in terms of performance, and
ideally provide some kind of benefit besides eliminating a dependency.
> > I have in mind, several improvements of the algorithms. I will codify
> > and test, and if run well, I include in the algorithms. I want, too,
> > simplify the interface of indirect_sort, for to be used in a easy way by
> > any sort algorithm. If you are interested, I will prepare too a brief
> > document with the explanation and the internal algorithm.
>
> I'd like to see that document.
Along with testing a struct that is an array of ints, you should probably
test variable-length string sorting, as string sorting is very common.
Merry Christmas,
Steve
If you prepare the operations for to do in that test, I can prepare an
additional benchmark with all the algorithms and your operations.
After all, we see and take a decision
Yours
Francisco
On Tue Dec 23 2014 at 3:31:32 AM Francisco José Tapia <fjt...@gmail.com>
wrote:
> We can do too a benchmark with variable lenght elements with strings.
>
> If you prepare the operations for to do in that test, I can prepare an
> additional benchmark with all the algorithms and your operations.
>
> After all, we see and take a decision
>
> Yours
>
> Francisco
>
The approach I've used with the boost::sort library is to set up the
benchmark tests so that they read in a file of random data generated by a
single application (randomgen), or any other file that is passed in for
testing, and they write out their results to file. This enables better
testing and debugging in case there are any problems (I directly diff the
sort results when I can expect them to be identical; tune.pl does this
automatically).
I would prefer if you used stringsample.cpp and int64.cpp as examples of
how to do this, and downloaded the boost sort library and added any
modifications you need into that code. The library has a tune.pl script
that runs the benchmarks. It may be helpful to write some utility
templated funciton to switch between different algorithms, instead of the
current approach of using a "-std" switch to use std::sort.
https://github.com/spreadsort/sort
There is a README there on how to install it using modular boost.
Thanks by your ideas. I will check.
Now I am designing, developing and testing several improvements for the
algorithms. When finish I will prepare a clear and easy interface for to
transform any sort algorithm, even the parallel, in a indirect sort, and
prepare a brief document where explain how to do.
Perhaps it will be a good moment for to move all the code to the namespace
boost::sort
Only after these things, I will begin to prepare the big benchmark. And
then, I would like your experience and opinion.
My idea is to make the benchmark running in a easy way, if we want know
the opinion and results of other members of the Boost community, we must
try to simplify their work.
I will inform of my progress, but I must travel for to be with my family,
and until January, I didn't run at full speed.
Happy Christmas. My best wishes for you and the Boost community
Francisco
This Christmas I had been working hard. I developed a new version of the
merge sort about 10-15% faster than the previous, using an additional
memory only ½ of the memory used by the data. The GCC and TBB algorithms
use an additional memory as the size of the data.
This version is exception safe. If an exception occurs due to the
algorithm ( only throw exceptions in the allocation and deallocation of the
additional memory), the integrity of the data is guarantee. You have all
the original data (unordered in the allocation, and ordered in the
deallocation)
I examined the TBB version and the secret is the sample sort algorithm.
The version of TBB is provisional and it's not exception safe. I am working
in the design of a new version of this algorithm exception safe. The actual
version of the parallel stable algorithm is about 20% slower than the TBB
version in the worse case, but when the size of the elements grows, the
indirect version is faster than all the others, with 64 bytes, is around
10% faster than the best TBB version, with 128 bytes , the best version of
TBB is around a 90% slower, and with 256 is 240% slower.
An additional advantage of the indirect sort is they use additional memory
for the pointers, not for the data, and the additional memory is around a
10% of the used by the data.
I hope to have the new algorithm in two weeks, more or less. Then I will
put all in the name space boost::sort, and design all the test and
benchmarks, as commented in the previous message.
Yours
Francisco
On Tue Jan 13 2015 at 12:43:00 PM Francisco José Tapia <fjt...@gmail.com>
wrote:
> Hi Steven,
>
> This Christmas I had been working hard. I developed a new version of the
> merge sort about 10-15% faster than the previous, using an additional
> memory only ½ of the memory used by the data. The GCC and TBB algorithms
> use an additional memory as the size of the data.
>
>
> This version is exception safe. If an exception occurs due to the
> algorithm ( only throw exceptions in the allocation and deallocation of the
> additional memory), the integrity of the data is guarantee. You have all
> the original data (unordered in the allocation, and ordered in the
> deallocation)
>
> Less memory, comparable speed, and exception safety sounds good.
> I examined the TBB version and the secret is the sample sort algorithm.
> The version of TBB is provisional and it's not exception safe. I am working
> in the design of a new version of this algorithm exception safe. The actual
> version of the parallel stable algorithm is about 20% slower than the TBB
> version in the worse case, but when the size of the elements grows, the
> indirect version is faster than all the others, with 64 bytes, is around
> 10% faster than the best TBB version, with 128 bytes , the best version of
> TBB is around a 90% slower, and with 256 is 240% slower.
>
> Yes, indirect sorting is clearly superior for large data types, and it
would be good to make it as easy as possible to use. It'd be nice to also
have a simple wrapper that can be use the spreadsort algorithm with
indirect sorting too.
I am in the process of integrating the first version of the boost sort
library in https://github.com/boostorg/sort. It would be good to integrate
your testing with the tests in that library.
<snip>
> > Yes, indirect sorting is clearly superior for large data types, and it
> would be good to make it as easy as possible to use. It'd be nice to also have a
> simple wrapper that can be use the spreadsort algorithm with indirect sorting too.
Would Boost.Multiprecision (50 decimal digits floating-point say) provide a simple demo type?
> I am in the process of integrating the first version of the boost sort library in
> https://github.com/boostorg/sort. It would be good to integrate your testing with
> the tests in that library.
Perhaps a git branch off the new /sort/develop branch would be useful?
I note that Steven Ross is making a separate namespace for spreadsort
boost::sort::spreadsort
Although this seems long (slow typists can provide an alias or a using ... ;-)
I strongly suggest that you should follow this example, perhaps
boost::sort::tbb
though I don't like tbb at all - far too acronymic? Think of a better name, even if longer?
boost::sort::parallel?
Paul
PS The new Sort/Spreadsort docs are almost ready for release.
I will add Quickbook documentation (from plain text notes) for your parallel sort version if you write to me off-list.
---
Paul A. Bristow
Prizet Farmhouse
Kendal UK LA8 8AB
+44 (0) 1539 561830
I have the final version of all the algorithms. I wrote 6 algorithms of
stable sort, and selected the fastest. On my computer is about 15% faster
than the GCC stable sort.
I had problems too with an error in an algorithm mixed with an error in
the GCC dynamic memory system, Under certain sequence of request, the
system provide a non valid address and due this a segmentation fault. The
debug for to detect and correct had been a nightmare.
The algorithms provided are :
1.- sort : The algorithm is intro_sort. Don't need additional memory.
Performance similar to the GCC version and Windows version. Non stable
2.- parallel_sort : Decompose the global range in small ranges for to be
sorted with intro_sort. Don't need additional memory. Similar performance
than the TBB version and Microsoft version. Non Stable
3.- stable_sort : The algorithm is called smart_merge_sort, and is about
10-15% faster than the GCC stable sort version, The additional memory is an
half of the memory used by the data.( The GCC stable sort use the same
memory than the data)
4.- sample_sort : is a parallel stable sort. Have the same performance than
the TBB implementations, and much more faster than the GCC parallel
stable_sort. The additional memory is the same than the data
5.- parallel_stable_sort : This algorithm is based on the sample sort.
Increase the sample_sort time about a 5%, but the additional memory used is
an half of the memory used by the data.
The functions are in the namespàce boost::sort All the algorithms are
exception safe, and use indirect sort when the size of the objects is big.
According to my benchmarks, the size limit for to change a indirect sort is
128 bytes.
The function format of the parallel algorithms are
algorithm ( first, last, compare , number_threads).
The number of threads is passed with an struct NThreads which receives an
unsigned, and return an unsigned. The default construct of this struct is
filled with the HW threads
At today I am testing on Windows with Visual Studio 2013 CTP 4, and
changing some things of the code due to the incompatibility with C++11
features
In a few days I will document all the functions of the code for generate a
doxygen documentation, and can send you a version of the code.
Then we must talk about how to design the benchmark. I think the starting
point can be your experience with your code
The C++ committee propose a format for to call the parallel functions
described in
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3554.pdf
it would be a good idea to examine, and decide if follow the proposal, if
we don't follow or if we provide the two interfaces for the parallel
algorithms. According with my first impression it's not very complex to
implement.
Yours
Francisco Tapia
Could you send them please? I'd like to verify the speedup you're seeing,
and compare vs TBB. I'm ok if it just is comparable in speed to TBB and it
has additional advantages of some type.
>
> I had problems too with an error in an algorithm mixed with an error in
> the GCC dynamic memory system, Under certain sequence of request, the
> system provide a non valid address and due this a segmentation fault. The
> debug for to detect and correct had been a nightmare.
>
One thing to watch out for is undefined operations, such as type
conversions between int and float types; those aren't compiler bugs:
they're undefined (this problem hit me with float_sort, before someone
showed me how to convert safely). I hope you've fixed the bug.
> The algorithms provided are :
>
> 1.- sort : The algorithm is intro_sort. Don't need additional memory.
> Performance similar to the GCC version and Windows version. Non stable
> 2.- parallel_sort : Decompose the global range in small ranges for to be
> sorted with intro_sort. Don't need additional memory. Similar performance
> than the TBB version and Microsoft version. Non Stable
>
Does it provide any advantage over those implementations, aside from being
open-source?
>
> 3.- stable_sort : The algorithm is called smart_merge_sort, and is about
> 10-15% faster than the GCC stable sort version, The additional memory is an
> half of the memory used by the data.( The GCC stable sort use the same
> memory than the data)
>
GCC uses no additional memory, or 100% additional memory?
Did you compare against the tbb stable sort I pointed you to?
>
> 4.- sample_sort : is a parallel stable sort. Have the same performance than
> the TBB implementations, and much more faster than the GCC parallel
> stable_sort. The additional memory is the same than the data
>
> 5.- parallel_stable_sort : This algorithm is based on the sample sort.
> Increase the sample_sort time about a 5%, but the additional memory used is
> an half of the memory used by the data.
>
That's interesting, so it's better from a memory perspective than the
competing algorithms, and not bad in speed? That sounds like it's worth
serious consideration as an addition.
> The functions are in the namespàce boost::sort All the algorithms are
> exception safe, and use indirect sort when the size of the objects is big.
> According to my benchmarks, the size limit for to change a indirect sort is
> 128 bytes.
>
> The function format of the parallel algorithms are
>
> algorithm ( first, last, compare , number_threads).
>
> The number of threads is passed with an struct NThreads which receives an
> unsigned, and return an unsigned. The default construct of this struct is
> filled with the HW threads
>
> At today I am testing on Windows with Visual Studio 2013 CTP 4, and
> changing some things of the code due to the incompatibility with C++11
> features
>
> In a few days I will document all the functions of the code for generate a
> doxygen documentation, and can send you a version of the code.
>
> Then we must talk about how to design the benchmark. I think the starting
> point can be your experience with your code
>
Please look at this code, and refactor/reuse what test code you can:
https://github.com/boostorg/sort
Specifically, look in the test directory, the example directory, and look
in tune.pl. Are there any parameters you would want to tune for specific
systems (besides the number of threads)?
The basic idea is that randomgen writes out a randomized distribution to
disk, and then we run whatever algorithms we want to compare (writing their
results to disk), verify that the results are correct, then compare speed.
I'd prefer not to skip the disk intermediate step because having the files
on disk makes it easier to debug when something gives incorrect results.
We should compare with variable-length strings and ints, in addition to
more complicated data structures like what you've been testing with.
It's worth noting that it's quite common to use vectors for large
quantities of data in a class/struct, in which case the sorting would
already be indirect.
>
> The C++ committee propose a format for to call the parallel functions
> described in
>
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3554.pdf
>
> it would be a good idea to examine, and decide if follow the proposal, if
> we don't follow or if we provide the two interfaces for the parallel
> algorithms. According with my first impression it's not very complex to
> implement.
>
With your implementations, it looks like people have these choices:
1) how many threads to use
2) stable or not
3) memory or speed efficient
4) direct or indirect
Note that #4 may be influenced by #3, and that you may be able to determine
that automatically, but providing people a simple way to make those choices
is a good idea. You could use the approach suggested by the committee if
it fits well, or another one if you have a better approach to this specific
problem.
Steve
In the file of this link, in my dropbox space, you have the code, the test
programs and the benchmarks with GCC.
https://dl.dropboxusercontent.com/u/8437476/works/sort/Sort_V0.1.7z
*FILE DESCRIPTION*
In the root folder of the file you have a INSTRUCTIONS.txt file detailing
the content of all the folders of the file, information for to compile the
programs, where is each file, and which options and libraries you must link.
*Test*
Contains the test programs of each class and algorithms.
*Src*
Contains the source code of all. In the folder *src/boost/sort* there is a
file *algorithm.hpp*, which is the only include file needed for invoke the
algorithms.
The parallel algorithms have a parameter of the NThread type which
indicate the number of threads to use. By default use the same number than
the HW threads of the machine. it's a very simple structure defined in
boost/sort/util/atomic.hpp.
*Benchmarks*
In this folder are the benchmark code with GCC compiler under Linux x64. In
the future I will add benchmarks with other compiler and Operating Systems.
The folders *cpp-TimSort-master*, and *parallel_stable_sort* have the C++
code of the TimSort algorithm, found in
https://travis-ci.org/gfx/cpp-TimSort, and the code of the sample sort with
TBB.
The code had been modified, due to the four versions use the same name
space and many common names, and was not possible invoke the four versions
in the same program. The modification put each version in a different name
space.
In the folder *GCC*, you can find 3 folders
-
*algorithm* : Have the code of the benchmark programs
-
*bin *: have the compiled and static linked version of the benchmark
programs. In this folder you have a shell run.sh which run sequentially the
programs and collect the results obtained in .txt files with the same name
than the program
-
*Results *: in this folder you can find the result generated by the
benchmark programs on different machines. Each machine have a folder where
you can find the text files with the same name than the programs with the
results, and a description.txt file with the description of the HW where
run the benchmarks. The format used in this folder is the suggested for to
present the results of all the machines used in the test. Here you have the
result of my two computers, and can have a first idea about the results
expected in other machines.
The benchmarks, like in the old version, have a version with 64 bits
numbers, and a version with objects of several sizes.
About the memory used for of the algorithms. If you consider M the memory
used by the data, the total memory used by the algorithms are :
-
GCC sort → M
-
boost sort → M
-
GCC parallel_sort → M
-
boost parallel_sort → M
-
TBB parallel_sort → M
-
GCC stable_sort → 2 M
-
boost stable_sort → 1.5 M
-
TimSort -> 2M
-
GCC parallel_stable_sort → 2.5 M
-
TBB parallel_sort → 2 M
-
boost sample_sort → 2 M
-
boost parallel_stable_sort → 1.5 M
The algorithms use the indirect version when the size of the objects is
greater than 128 bytes.
I am speaking with several friends, for to run the benchmarks in more
powerful machines.
I think, the first is to check the speed. If the speed is good, we have
something useful , if not, we must decide what must do with this. If
something is useful , or we must leave all.
If we decide continue, the next steps to do are :
Reorganize the benchmark, and include benchmarks with string objects. Your
idea about the use of files, I thing is the correct way to follow
Finish all the documentation of the code for to generate a Doxygen
documentation
Generate the web pages of documentation for the Boost project
Modify the test programs for to do according the bjam files
If you found any problem , error or doubt, please , contact me for to
resolve
Francisco
Thanks for your detailed explanation.
I'll repeat what I've said before about your object benchmark:
I suggest changing the struct you use for benchmarking to this:
template <uint32_t NN>
struct reburcio
{
uint64_t M[NN];
reburcio () {
init();
}
void init() {
for (int i = 0; i < NN; ++i) {
M[i] = my_rand(); // or something with less variability to test
identical substring performance
}
}
bool operator < ( const reburcio &R) const{
for (int i = 0; i < NN; ++i) {
if (M[i] != R.M[i]) {
return M[i] < R.M[i];
}
}
return false;
}
};
That way it's using the default copy and assignment operators, which copy
over the entire datastructure, and it initializes and compares all elements
of the array. It appears that the benchmark you sent me only copies over
and compares the first element in the array, which might have caused some
of the large-element differences in performance you were seeing.
Another alternative is just using a vector for the array of elements.
On Tue, Mar 24, 2015 at 4:56 PM Francisco José Tapia <fjt...@gmail.com>
wrote:
>
> -
>
> GCC sort → M
> -
>
> boost sort → M
>
I don't think we'll ship this; this looks more like a proof-of-concept,
unless the indirect sorting benefits are demonstrable after fixing the
object you use in testing as described above.
-
>
> GCC parallel_sort → M
> -
>
> boost parallel_sort → M
>
I see no advantage to this sort relative to tbb, and its performance on
already-sorted data needs to improve to be competitive with tbb.
> -
>
> TBB parallel_sort → M
>
>
> -
>
> GCC stable_sort → 2 M
> -
>
> boost stable_sort → 1.5 M
>
The combination of improved speed and lower memory usage looks impressive.
I'll try to verify this, but this looks promising.
> -
>
> TimSort -> 2M
> -
>
> GCC parallel_stable_sort → 2.5 M
> -
>
> TBB parallel_sort → 2 M
> -
>
> boost sample_sort → 2 M
> -
>
> boost parallel_stable_sort → 1.5 M
>
Both of these algorithms look good. I'm inclined towards
parallel_stable_sort because memory is becoming more of a bottleneck on
modern systems, but I'll have to do some testing.
Timsort seems fairly impressive, but specialized and already available for
free, hence not necessary to include in the library.
>
>
> If you found any problem , error or doubt, please , contact me for to
> resolve
>
Please fix your object benchmarks as described at the beginning of my
email; otherwise I don't trust them. I also would like to see string
sorting speeds. Once that is all together, I'd like to test and review
them carefully and look to see if there are any weaknesses missed by your
current tests. If they look good, and we can agree on which algorithms you
want to add to the library, we should move on to cleaning this up for
review.
Steve
On 17 December 2014 at 05:42, Francisco José Tapia <fjt...@gmail.com>
wrote:
>
> SCALABILITY : Some algorithms don't scale well when the number of threads
> grow. Increasing the number of threads don't decrease the execution time.
> This lack of scalability is emphasized when the size of the elements to
> sort grows
Sorry about butting in so late, but on the topic of scalability and
potential superiority to TBB, you might be interested to check out Hartmut
Kaiser's talk on asynchronous computing (
https://www.youtube.com/watch?v=5xyztU__yys) and his lab's library, HPX (
http://stellar.cct.lsu.edu/projects/hpx/).
Cheers.
Jeremy
Francisco
On Wed, Mar 25, 2015 at 3:57 PM Francisco José Tapia <fjt...@gmail.com>
wrote:
> TEST OBJECTS
>
> The object used in the test ( Reburcio is a word of an exercise from my
> first pascal book, and don't have any sense) In the construction , copy
> construction and operator =, copy all the elements with a loop. The only
> one operation which use only the first element is the comparison.
>
> I had done a new comparison, and compare the sum of all the elements of the
> object. The time are similar, you must add 5% in the small objects to 15%
> in the big objects.
>
I'm not precisely sure of your meaning, but I suggested modification of the
test object to use default operators where possible, to randomize the
entire contents of the object (so that not all integers in the array are
the same), and to use the entire contents of the object for comparison. I
believe that these suggested changes make it more reflective of the
conventional large-object comparison case. Most of the time, only the
first integer will be compared, but in equal cases the entire array will be
compared. I don't like feeding overly simplified objects into benchmarks,
because compilers can make some surprising optimizations.
> TIMSORT
>
> The Tim Sort code is not mine.
>
Agreed. I think we can drop it from this comparison, as your algorithm is
clearly much faster. I mentioned it because it is used with other
languages, and responded when somebody else on the email group suggested I
allow it. I'm not asking you for an implementation.
Timsort uses N/2 additional memory.
> DESIGN PHASE
>
> Now , we are in the design phase. Nothing is fixed. We can comment and
> change anything.
>
> The specification is simple, provide sort algorithms, stable and not
> stable, for to run with 1 thread or with any number of threads, without any
> other external library ( as Open MP , TBB, CilkPlus..). Only with the C++
> standard.
>
> If you want to discuss about this, we can do. We admit ideas, suggestions,
> opinions. Any help is welcome. The final decision about if the code is
> included or not in the boost sort library, depend of Steven, the creator of
> the library.
>
> After the design phase, with the decisions taken and the interfaces fixed.
> We prepare the final version, redesign a test system, test in deep and
> create the documentation.
>
I suggest deciding what you want to try to include. Getting a competitive
parallel unstable sort to tbb will be hard, and I'm skeptical of including
a library that just replicates what tbb provides for easy use.
> I tried to improve, I would like to have more time, more equipment, and
> better documentation on order to find the best options, but I don't have,
> and I do this in my free time after my work and my family.
>
I'm in the same situation with regards to available time. I think many
boost participants are.
>
> I designed 6 stable sort algorithms, for to select the fast (internally
> named smart_merge_sort, about 10% faster than the GCC version). I prepared
> a version of Sample Sort, only with threads and atomic variables. In the
> benchmarks have similar performance than the TBB version, use one half of
> the auxiliary memory used by TBB and is exception safe.
>
I like your parallel stable sorts. They look like the most promising part
of your proposed library additions. Indirect sorting is also interesting.
I just want reliable large-object and string comparison numbers for them
before diving in further.
Have you looked at integrating your testing with the approach in the Boost
Sort library? The randomgen binary generates files with different types of
random number distributions that can test a variety of scenarios, and the
tune.pl script controls automated performance testing.
>
> In my opinion, the parallel stable sort is reasonably good for to be
> included in the final version. The important case, which shows the quality
> of the algorithm, is when the data are unsorted. I will try to implement
> the detection of the sorted elements , like in TBB, but I am not sure about
> the results. Th internal loops of the algorithm are delicate, and a small
> change , can produce big changes in the performance, as happened in other
> algorithms
>
> Is there a way you can do a fast initial pass, checking for already-sorted
data? That's what Spreadsort does (at each level of the sort).
> Hi Steven
>
> Thanks by your message.
>
> Until now , I had been focused mainly in the algorithms, and I didn't
> dedicate many time to others things as the test format and the integration
> with boost::sort.Now, the first things to do are :
>
> - Try to detect when the data are ordered,
>
Just verify that each successive element is not less than the previous. If
inputs are unique (or completely identical if they compare the same), then
you can compare results with those of std::sort.
> - Revise in deep your library and code , in order to adapt the test and
> benchmarks to the procedures used in your library
>
> About the object used in the benchmarks, we can do something simple.
> Reburcio is a very small class in the file
> Benchmarks/GCC/algorithm/reburcio.hpp. Please rewrite and send me, and
> only
> need to recompile the benchmarks.
>
I suggest renaming the variables to be clearer, but here's the idea:
template <uint32_t NN>
struct intarray
{
uint32_t M[NN];
bool operator < ( const intarray &R) const{
for (int i = 0; i < NN; ++i) {
if (M[i] != R.M[i]) {
return M[i] < R.M[i];
}
}
return false;
}
};
Use <BOOST_ROOT>/libs/sort/example/randomgen.cpp (b2 libs/sort/randomgen
--tune will build the binary in the libs/sort dir) to generate randomized
input files, and memcpy them into arrays of intarray to fill them with the
data to sort. This will emulate the large non-vector objects the indirect
sorting approach you've designed is optimized for, and will fill the entire
object with randomized contents, along with having a full (default) copy
constructor.
You can use strings in the place of intarray also for testing strings; I
randomize them by just using the >> operator on strings to read them in; it
will automatically break on some characters, causing the strings to have
randomized length.
>
>
> *About the parts to include*
>
> With the parallel unstable sort, all the algorithms I examined have the
> same approach, and due this, similar performances. The range of decision is
> small.
>
> The goal is provide algorithms independent of any library or Operating
> System, fast , robust and easy to use. The idea of to do the same than a
> company is the origin of many Free SW, think about Oracle and MySQL,
> Internet Explorer and Firefox. Even the C++ standard, is to do the same
> than many companies are doing since many years ago, each with its own way.
> The final users are grateful because simplify our work.
>
> TBB is available in Windows, Linux and Os X. With others operating system
> ( by example Android and all the real time OS), you must recompile all the
> source code, and I have not clear about the dynamic linking of TBB in these
> operating systems. Many small machines don't have task scheduler, but have
> threads. To force to recompile TBB for to use a parallel sort is like force
> to rent a bus, for to transport only one person.
>
Who wants to do a parallel sort on Android? The OS often only gives you
one core (depending on priorities), and it would burn the battery at
ridiculous speed. I just don't see the use case, unless you can provide
some advantage in some fairly common use case, and it would have to fully
match tbb's features to within a few percent, including efficiency on
already-sorted input.
> Our code have similar performance, small code, independent of any Operating
> System, of any other code or library. If you are C++11 compliant you have
> parallel sort. I think, we must include sort, parallel sort, stable sort
> and parallel stable sort. Perhaps, too, sample sort, but it's less
> important
>
> Your stable sort and parallel stable sort look promising, and sample sort
is a possibility. I recommend concentrating on them for now.