Task for Amine

Bonita Montero

unread,

Nov 16, 2019, 1:36:51 AM11/16/19

to

Here, Amine, that's a generic parallel merge-sort- as well as a parallel
quicksort-algorithm. The merge-sort-algorithm is faster than the Micro-
soft STL library stable_sort. But the quicksort is slower than the MS
STL's implemenation. So improve the latter to make it faster than the
MS-implementation! I don't know how to get it faster.

#include <algorithm>
#include <iterator>
#include <cstdlib>
#include <thread>
#include <functional>
#include <exception>
#include <mutex>
#include <memory>
#include <execution>
#include <array>

template<typename T>
struct invoke_on_destruct
{
private:
T &m_t;
bool m_enabled;

public:
invoke_on_destruct( T &t ) :
m_t( t ), m_enabled( true )
{
}

~invoke_on_destruct()
{
if( m_enabled )
m_t();
}

void invoke_and_disable()
{
m_t();
m_enabled = false;
}
};

template<typename It,
typename Cmp = std::less<typename
std::iterator_traits<It>::value_type>,
typename Allocator = std::allocator<typename
std::iterator_traits<It>::value_type>>
class merge_sort
{
private:
using ep_allocator = typename
std::allocator_traits<Allocator>::template rebind_alloc<std::exception_ptr>;
using exceptions_vector = std::vector<std::exception_ptr,
ep_allocator>;

public:
merge_sort( It start, It end, Cmp const &cmp = Cmp(), unsigned
nThreads = 0, Allocator const &alloc = Allocator() );

struct par_exception : public std::exception
{
using iterator = typename std::vector<std::exception_ptr,
ep_allocator>::iterator;
// there's no copy-constructor because of internal vector
// so catch par_exception only via reference
~par_exception();
iterator begin();
iterator end();
private:
friend class merge_sort;
par_exception( std::vector<std::exception_ptr, ep_allocator>
&&exceptions );
std::vector<std::exception_ptr, ep_allocator> m_exceptions;
};

private:
Cmp m_cmp;
Allocator m_alloc;
// mutex that protects the exception-array
std::mutex m_excMtx;
exceptions_vector m_exceptions;
template<typename UpIt>
void parRecursion( UpIt start, UpIt end, unsigned nThreads );
template<typename UpIt, typename BufIt>
void recursion( UpIt start, UpIt end, BufIt buf );
template<typename UpIt, typename BufIt>
void merge( UpIt start, BufIt leftBuf, BufIt rightBuf, BufIt bufEnd );
};

template<typename It, typename Cmp, typename Allocator>
merge_sort<It, Cmp, Allocator>::merge_sort( It start, It end, Cmp const
&cmp, unsigned nThreads, Allocator const &alloc ) :
m_cmp( cmp ),
m_alloc( alloc ),
m_exceptions( ep_allocator( alloc ) )
{
using namespace std;
// threads == 0 -> number of threads = hardware-threads
unsigned hwThreads = thread::hardware_concurrency();
hwThreads += hwThreads == 0;
nThreads = nThreads == 0 ? hwThreads : nThreads <= hwThreads ?
nThreads : hwThreads;
// reserve number of threads elements in the exception_ptr-vector
// so that there will be no exception when we do emplace_back
m_exceptions.reserve( nThreads );
try
{
parRecursion( start, end, nThreads );
if( m_exceptions.size() )
if( m_exceptions.size() > 1 )
// multiple exceptions from threads: throw par_exception
throw par_exception( move( m_exceptions ) );
else
// only one exception from threads: rethrow it
rethrow_exception( m_exceptions[0] );
}
catch( ... )
{
if( m_exceptions.size() )
{
// additional exception catched: throw par_exception
m_exceptions.emplace_back( current_exception() );
throw par_exception( move( m_exceptions ) );
}
else
// single exception: rethrow it
throw;
}
}

template<typename It, typename Cmp, typename Allocator>
template<typename UpIt, typename BufIt>
void merge_sort<It, Cmp, Allocator>::recursion( UpIt start, UpIt end,
BufIt buf )
{
using namespace std;
if( end - start <= 1 )
return;
copy( start, end, buf );
size_t n = end - start;
BufIt leftBuf = buf,
leftBufEnd = buf + n / 2,
rightBuf = leftBufEnd,
bufEnd = buf + n;
recursion( leftBuf, leftBufEnd, bufEnd );
recursion( rightBuf, bufEnd, bufEnd );
merge( start, leftBuf, rightBuf, bufEnd );
}

template<typename It, typename Cmp, typename Allocator>
template<typename UpIt, typename BufIt>
void merge_sort<It, Cmp, Allocator>::merge( UpIt start, BufIt leftBuf,
BufIt rightBuf, BufIt bufEnd )
{
BufIt leftBufEnd = rightBuf;
for( UpIt wrtBack = start; ; )
if( m_cmp( *leftBuf, *rightBuf ) )
{
*wrtBack++ = *leftBuf;
if( ++leftBuf == leftBufEnd )
{
// faster for small number of elements than std::copy
do
*wrtBack++ = *rightBuf;
while( ++rightBuf != bufEnd );
break;
}
}
else
{
*wrtBack++ = *rightBuf;
if( ++rightBuf == bufEnd )
{
do
*wrtBack++ = *leftBuf;
while( ++leftBuf != leftBufEnd );
break;
}
}
}

template<typename It, typename Cmp, typename Allocator>
template<typename UpIt>
void merge_sort<It, Cmp, Allocator>::parRecursion( UpIt start, UpIt end,
unsigned nThreads )
{
using namespace std;
using T = typename iterator_traits<It>::value_type;
size_t n = end - start;
if( nThreads <= 1 )
{
vector<T, Allocator> buf( m_alloc );;
size_t bs = 0;
// calculate buffer-/stack-size
for( size_t split = end - start; split > 1; bs += split, split
-= split / 2 );
buf.resize( bs );
recursion( start, end, buf.begin() );
}
else
{
// split-buffer
vector<T, Allocator> buf( m_alloc );;
buf.resize( n );
copy( start, end, buf.begin() );
// array for left-recursion and right-recursion thread
array<thread, 2> threads;
// automatically join threads when an exception is thrown
auto joinThreads = [&threads]()
{
for( thread &thr : threads )
// try to join infinitely because the thread
// will continue to access our buffer
if( thr.get_id() != thread::id() )
for( ; ; )
try
{
thr.join();
break;
}
catch( ... )
{
}
};
invoke_on_destruct<decltype(joinThreads)> iodJoin( joinThreads );
// iterator-type for our split-buffer
using BufIt = typename vector<T, Allocator>::iterator;
// proxy thread-lambda for our new threads
auto prProxy = [this]( BufIt start, BufIt end, unsigned nThreads )
{
try
{
parRecursion( start, end, nThreads );
}
catch( ... )
{
// remember exception in our exception-array
unique_lock<mutex> excLock( m_excMtx );
m_exceptions.emplace_back( current_exception() );
}
};
unsigned rightThreads = nThreads / 2,
leftThreads = nThreads - rightThreads;
// if the left number of threads is uneven give the threads
more input
size_t left = (size_t)(n * ((double)leftThreads /
nThreads)),
right = n - left;
BufIt leftBuf = buf.begin(),
leftBufEnd = buf.begin() + left,
rightBuf = leftBufEnd,
bufEnd = buf.begin() + n;
// start left thread
threads[0] = move( thread( prProxy, leftBuf, leftBufEnd,
leftThreads ) );
if( rightThreads > 1 )
// start right thread
threads[1] = move( thread( prProxy, rightBuf, bufEnd,
rightThreads ) );
else
// there's only one thread right, so we do it on our own
parRecursion( rightBuf, bufEnd, 1 );
// join threads
iodJoin.invoke_and_disable();
// stop if there are any exceptions from the threads
unique_lock<mutex> excLock( m_excMtx );
if( m_exceptions.size() )
return;
excLock.unlock();
// merge split-buffer back into input-buffer
merge( start, leftBuf, rightBuf, bufEnd );
}
}

template<typename It, typename Cmp, typename Allocator>
merge_sort<It, Cmp, Allocator>::par_exception::~par_exception()
{
}

template<typename It, typename Cmp, typename Allocator>
merge_sort<It, Cmp, Allocator>::par_exception::par_exception( typename
merge_sort<It, Cmp, Allocator>::exceptions_vector &&exceptions ) :
m_exceptions( std::move( exceptions ) )
{
}

template<typename It, typename Cmp, typename Allocator>
typename merge_sort<It, Cmp, Allocator>::par_exception::iterator
merge_sort<It, Cmp, Allocator>::par_exception::begin()
{
return m_exceptions.begin();
}

template<typename It, typename Cmp, typename Allocator>
typename merge_sort<It, Cmp, Allocator>::par_exception::iterator
merge_sort<It, Cmp, Allocator>::par_exception::end()
{
return m_exceptions.end();
}

#if defined(_MSC_VER)
#pragma warning(disable: 26444)
#endif

template<typename It,
typename Cmp = std::less<typename
std::iterator_traits<It>::value_type>,
typename Allocator = std::allocator<typename
std::iterator_traits<It>::value_type>>
class quick_sort
{
private:
using ep_allocator = typename
std::allocator_traits<Allocator>::template rebind_alloc<std::exception_ptr>;
using exceptions_vector = std::vector<std::exception_ptr,
ep_allocator>;

public:
quick_sort( It start, It end, Cmp const &cmp = Cmp(), unsigned
nThreads = 0, Allocator const &alloc = Allocator() );

struct par_exception : public std::exception
{
using iterator = typename std::vector<std::exception_ptr,
ep_allocator>::iterator;
// there's no copy-constructor because of internal vector
// so catch par_exception only via reference
~par_exception();
iterator begin();
iterator end();
private:
friend class quick_sort;
par_exception( std::vector<std::exception_ptr, ep_allocator>
&&exceptions );
std::vector<std::exception_ptr, ep_allocator> m_exceptions;
};

private:
Cmp m_cmp;
Allocator m_alloc;
// mutex that protects the exception-array
std::mutex m_excMtx;
exceptions_vector m_exceptions;
void parRecursion( It left, It right, unsigned nThreads );
void recursion( It left, It right );
It partition( It left, It right );
};

template<typename It, typename Cmp, typename Allocator>
quick_sort<It, Cmp, Allocator>::quick_sort( It start, It end, Cmp const
&cmp, unsigned nThreads, Allocator const &alloc ) :
m_cmp( cmp ),
m_alloc( alloc ),
m_exceptions( ep_allocator( alloc ) )
{
using namespace std;
// threads == 0 -> number of threads = hardware-threads
unsigned hwThreads = thread::hardware_concurrency();
hwThreads += hwThreads == 0;
nThreads = nThreads == 0 ? hwThreads : nThreads <= hwThreads ?
nThreads : hwThreads;
// reserve number of threads elements in the exception_ptr-vector
// so that there will be no exception when we do emplace_back
m_exceptions.reserve( nThreads );
try
{
parRecursion( start, end - 1, nThreads );
if( m_exceptions.size() )
if( m_exceptions.size() > 1 )
// multiple exceptions from threads: throw par_exception
throw par_exception( move( m_exceptions ) );
else
// only one exception from threads: rethrow it
rethrow_exception( m_exceptions[0] );
}
catch( ... )
{
if( m_exceptions.size() )
{
// additional exception catched: throw par_exception
m_exceptions.emplace_back( current_exception() );
throw par_exception( move( m_exceptions ) );
}
else
// single exception: rethrow it
throw;
}
}

template<typename It, typename Cmp, typename Allocator>
void quick_sort<It, Cmp, Allocator>::parRecursion( It left, It right,
unsigned nThreads )
{
using namespace std;
if( nThreads == 1 )
recursion( left, right );
else
{
if( left >= right )
return;
It p = partition( left, right );
// array for left-recursion and right-recursion thread
array<thread, 2> threads;
// automatically join threads when an exception is thrown
auto joinThreads = [&threads]()
{
for( thread &thr : threads )
// try to join infinitely because the thread
// will continue to access our buffer
if( thr.get_id() != thread::id() )
for( ; ; )
try
{
thr.join();
break;
}
catch( ... )
{
}
};
invoke_on_destruct<decltype(joinThreads)> iodJoin( joinThreads );
// proxy thread-lambda for our new threads
auto prProxy = [this]( It left, It right, unsigned nThreads )
{
try
{
parRecursion( left, right, nThreads );
}
catch( ... )
{
// remember exception in our exception-array
unique_lock<mutex> excLock( m_excMtx );
m_exceptions.emplace_back( current_exception() );
}
};
unsigned rightThreads = nThreads / 2,
leftThreads = nThreads - rightThreads;
threads[0] = move( thread( prProxy, left, p, leftThreads ) );
if( rightThreads > 1 )
// start right thread
threads[1] = move( thread( prProxy, p + 1, right,
rightThreads ) );
else
// there's only one thread right, so we do it on our own
recursion( p + 1, right );
}
}

template<typename It, typename Cmp, typename Allocator>
void quick_sort<It, Cmp, Allocator>::recursion( It left, It right )
{
if( left >= right )
return;
It p = partition( left, right );
recursion( left, p );
recursion( p + 1, right );
}

template<typename It, typename Cmp, typename Allocator>
It quick_sort<It, Cmp, Allocator>::partition( It left, It right )
{
using namespace std;
using T = typename iterator_traits<It>::value_type;
T pivot = left[(right - left) / 2];
It i = left - 1,
j = right + 1;
for( ; ; )
{
while( m_cmp( *++i, pivot ) );
while( m_cmp( pivot, *--j ) );
if( i >= j )
return j;
swap( *i, *j );
}
}

template<typename It, typename Cmp, typename Allocator>
quick_sort<It, Cmp, Allocator>::par_exception::~par_exception()
{
}

template<typename It, typename Cmp, typename Allocator>
quick_sort<It, Cmp, Allocator>::par_exception::par_exception( typename
quick_sort<It, Cmp, Allocator>::exceptions_vector &&exceptions ) :
m_exceptions( std::move( exceptions ) )
{
}

template<typename It, typename Cmp, typename Allocator>
typename quick_sort<It, Cmp, Allocator>::par_exception::iterator
quick_sort<It, Cmp, Allocator>::par_exception::begin()
{
return m_exceptions.begin();
}

template<typename It, typename Cmp, typename Allocator>
typename quick_sort<It, Cmp, Allocator>::par_exception::iterator
quick_sort<It, Cmp, Allocator>::par_exception::end()
{
return m_exceptions.end();
}

#include <iostream>
#include <random>
#include <chrono>
#include <limits>
#include <cassert>

using namespace std;
using namespace chrono;

int main( int argc, char **argv )
{
if( argc < 2 )
return EXIT_FAILURE;
size_t n = (size_t)(unsigned)atoi( argv[1]) * 1'000'000;
unsigned nThreads = argc < 3 ? 0 : (unsigned)atoi( argv[2] );
random_device rd;
uniform_int_distribution<int> uid( numeric_limits<int>::min(),
numeric_limits<int>::max() );
vector<int> rvRef;
rvRef.resize( n );
cout << "randomizing ..." << endl;
for( int &r : rvRef )
r = uid( rd );
vector<int> rvSort( rvRef );
cout << "sorting with merge_sort (multi-threaded) ..." << endl;
time_point<high_resolution_clock> start = high_resolution_clock::now();
using msInt = merge_sort<vector<int>::iterator>;
msInt( rvSort.begin(), rvSort.end(), less<int>(), nThreads );
double seconds = (double)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count() / 1.0E9;
cout << "sorting-time: " << seconds << " seconds" << endl;
for( vector<int>::iterator scn = rvSort.begin(); scn < rvSort.end()
- 1; ++scn )
assert(scn[0] <= scn[1]);
cout << "sorting with merge_sort (single-threaded) ..." << endl;
rvSort = rvRef;
start = high_resolution_clock::now();
msInt( rvSort.begin(), rvSort.end(), less<int>(), 1 );
seconds = (double)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count() / 1.0E9;
cout << "sorting-time: " << seconds << " seconds" << endl;
cout << "sorting with std::stable_sort (multi-threaded) ..." << endl;
rvSort = rvRef;
start = high_resolution_clock::now();
stable_sort( execution::parallel_policy(), rvSort.begin(),
rvSort.end(), less<int>() );
seconds = (double)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count() / 1.0E9;
cout << "sorting-time: " << seconds << " seconds" << endl;
cout << "sorting with std::sort (singled-threaded) ..." << endl;
rvSort = rvRef;
start = high_resolution_clock::now();
sort( rvSort.begin(), rvSort.end(), less<int>() );
seconds = (double)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count() / 1.0E9;
cout << "sorting-time: " << seconds << " seconds" << endl;
cout << "sorting with std::sort (multi-threaded) ..." << endl;
rvSort = rvRef;
start = high_resolution_clock::now();
sort( execution::parallel_policy(), rvSort.begin(), rvSort.end(),
less<int>() );
seconds = (double)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count() / 1.0E9;
cout << "sorting-time: " << seconds << " seconds" << endl;
cout << "sorting with quick_sort (multi-threaded) ..." << endl;
rvSort = rvRef;
start = high_resolution_clock::now();
using qsInt = quick_sort<vector<int>::iterator>;
qsInt( rvSort.begin(), rvSort.end(), less<int>(), nThreads );
seconds = (double)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count() / 1.0E9;
cout << "sorting-time: " << seconds << " seconds" << endl;
for( vector<int>::iterator scn = rvSort.begin(); scn < rvSort.end()
- 1; ++scn )
assert(scn[0] <= scn[1]);
return EXIT_SUCCESS;
}

Bonita Montero

unread,

Nov 16, 2019, 2:35:55 AM11/16/19

to

> template<typename It, typename Cmp, typename Allocator>
> It quick_sort<It, Cmp, Allocator>::partition( It left, It right )
> {
>     using namespace std;
>     using T = typename iterator_traits<It>::value_type;
>     T pivot = left[(right - left) / 2];
>     It i     = left - 1,
>        j     = right + 1;
>     for( ; ; )
>     {
>         while( m_cmp( *++i, pivot ) );
>         while( m_cmp( pivot, *--j ) );

Hm, i or j might repeatedly stop at left + (right - left) / 2.
So I should copy the outer loop without one of the inner loops.

Bonita Montero

unread,

Nov 16, 2019, 2:56:11 AM11/16/19

to

> Hm, i or j might repeatedly stop at left + (right - left) / 2.
> So I should copy the outer loop without one of the inner loops.

Sorry, nonsense, the mid could be swapped away so that
either i or j could get beyond the mid.

Bonita Montero

unread,

Nov 16, 2019, 5:31:56 AM11/16/19

to

> unsigned rightThreads = nThreads / 2,
> leftThreads = nThreads - rightThreads;

With this dyanamic partitioning the throughput on my random
-numers raises by 67%. Unbelievable for such a simple change.

unsigned rightThreads = (unsigned)(nThreads * (right - p +
1.0) / (right - left + 1)),

Wisdom90

unread,

Nov 16, 2019, 10:38:17 AM11/16/19

to

Look at this , i think it will help you to speed up Parallel Quicksort:

Parallel partition phase for quick sort

https://blogs.msdn.microsoft.com/dhuba/2012/03/04/parallel-partition-phase-for-quick-sort/

Thank you,
Amine Moulay Ramdane.

Bonita Montero

unread,

Nov 16, 2019, 11:02:47 AM11/16/19

to

> Look at this , i think it will help you to speed up Parallel Quicksort:
> Parallel partition phase for quick sort
> https://blogs.msdn.microsoft.com/dhuba/2012/03/04/parallel-partition-phase-for-quick-sort/

Do it yourself.
And learn a language others also use.
There's almost no one that uses Pascal.
Pascal has died in the 90s.
So there won't be almost no one that uses your code.

Wisdom90

unread,

Nov 16, 2019, 11:08:18 AM11/16/19

to

Hello,

I am not stupid to use Delphi and Freepascal compilers,
and of course i am also using there modern Object Pascal (and in the
Delphi mode of Freepascal), here is the proof that Delphi is a serious
compiler:

NASA is also using Delphi, read about it here:

https://community.embarcadero.com/blogs/entry/want-moreexploration-40857

The European Space Agency is also using Delphi, read about it here:

https://community.embarcadero.com/index.php/blogs/entry/delphi-s-involvement-with-the-esa-rosetta-comet-spacecraft-project-1

Read more here:

https://glooscapsoftware.blogspot.com/2017/05/software-made-with-delphi-how-do-you.html

Wisdom90

unread,

Nov 16, 2019, 11:24:23 AM11/16/19

to

On 11/16/2019 11:02 AM, Bonita Montero wrote:

Hello,

My Delphi projects work with C++Builder..

Here is C++Builder: https://www.embarcadero.com/products/cbuilder

Here is how to use my Delphi projects with C++Builder:

Mixing Delphi and C++

https://community.idera.com/developer-tools/b/blog/posts/mixing-delphi-and-c

Wisdom90

unread,

Nov 16, 2019, 11:24:33 AM11/16/19

to

Hello,

My Delphi projects work with C++Builder..

Here is C++Builder: https://www.embarcadero.com/products/cbuilder

Here is how to use my Delphi projects with C++Builder:

Mixing Delphi and C++

https://community.idera.com/developer-tools/b/blog/posts/mixing-delphi-and-c

Bonita Montero

unread,

Nov 16, 2019, 11:27:37 AM11/16/19

to

> Hello,
> I am not stupid to use Delphi and Freepascal compilers,
> and of course i am also using there modern Object Pascal (and in the
> Delphi mode of Freepascal), here is the proof that Delphi is a serious
> compiler:
> NASA is also using Delphi, read about it here:
> https://community.embarcadero.com/blogs/entry/want-moreexploration-40857

Of course there are a lot of projects which good results for the end
-user that are developed in Delphi. But Delphi ist simply a retarded
language.
And even if you consider how powerful a language is:
if you want others to use your code use the language they prefer.

Wisdom90

unread,

Nov 16, 2019, 11:31:01 AM11/16/19

to

I have made some of my Delphi projects work with C++, here is one of them:

My Parallel C++ Conjugate Gradient Linear System Solver Library that
scales very well version 1.76 is here..

Author: Amine Moulay Ramdane

Description:

This library contains a Parallel implementation of Conjugate Gradient
Dense Linear System Solver library that is NUMA-aware and cache-aware
that scales very well, and it contains also a Parallel implementation of
Conjugate Gradient Sparse Linear System Solver library that is
cache-aware that scales very well.

Sparse linear system solvers are ubiquitous in high performance
computing (HPC) and often are the most computational intensive parts in
scientific computing codes. A few of the many applications relying on
sparse linear solvers include fusion energy simulation, space weather
simulation, climate modeling, and environmental modeling, and finite
element method, and large-scale reservoir simulations to enhance oil
recovery by the oil and gas industry.

Conjugate Gradient is known to converge to the exact solution in n steps
for a matrix of size n, and was historically first seen as a direct
method because of this. However, after a while people figured out that
it works really well if you just stop the iteration much earlier - often
you will get a very good approximation after much fewer than n steps. In
fact, we can analyze how fast Conjugate gradient converges. The end
result is that Conjugate gradient is used as an iterative method for
large linear systems today.

Please download the zip file and read the readme file inside the zip to
know how to use it.

You can download it from:

https://sites.google.com/site/scalable68/scalable-parallel-c-conjugate-gradient-linear-system-solver-library

Language: GNU C++ and Visual C++ and C++Builder

Operating Systems: Windows, Linux, Unix and Mac OS X on (x86)

Bonita Montero

unread,

Nov 16, 2019, 11:32:08 AM11/16/19

to

> I have made some of my Delphi projects work with C++, here is one of them:

No one has the weird idea to mix Delphi and C++. And often there is code
with generic classes which can't be mixed. Accept it or not: Delphi is
and will never be relevant in the world of sw-development. At the time
when Delphi hit the market, the train for Pascal-derivatives had already
come to an end.

Wisdom90

unread,

Nov 16, 2019, 11:36:45 AM11/16/19

to

This project of mine works with: GNU C++ and Visual C++ and C++Builder
and with both Windows and Linux.

Bonita Montero

unread,

Nov 16, 2019, 11:40:51 AM11/16/19

to

>> No one has the weird idea to mix Delphi and C++. And often there is code
>> with generic classes which can't be mixed. Accept it or not: Delphi is
>> and will never be relevant in the world of sw-development. At the time
>> when Delphi hit the market, the train for Pascal-derivatives had already
>> come to an end.

> This project of mine works with: GNU C++ and Visual C++ and C++Builder
> and with both Windows and Linux.

Boy, you're manic. Seek advice from a doctor.
If you have a desire to be needed by other developers and to get their
esteem orient yourself according their needs. Providing Delphi-libraries
today will be as successful as selling ice in the antarctic.

Wisdom90

unread,

Nov 16, 2019, 11:45:17 AM11/16/19

to

Hello,

You have to understand my spirit..

I want to make Delphi much more powerful by inventing scalable
algorithms and implementing them in Delphi..

Here is one of my scalable algorithm and its implementation, read about
it here:

https://sites.google.com/site/scalable68/scalable-reference-counting-with-efficient-support-for-weak-references

Bonita Montero

unread,

Nov 16, 2019, 11:55:34 AM11/16/19

to

> I want to make Delphi much more powerful by inventing scalable
> algorithms and implementing them in Delphi..

That's like making a steam locomotive more powerful _today_.

Wisdom90

unread,

Nov 16, 2019, 12:02:43 PM11/16/19

to

I understand you, but i think you are not realistic..

Because i think that Delphi is still powerful, since i have worked
with it and i think that it is still powerful, what i also like in
Delphiis that it is strictly typed language, but i think that C++ has
inherited from C and it is not strictly typed as Delphi, and also Delphi
is good at readability and it more.

Read the rest:

Look at my "sportive" spirit..

About my scalable algorithms inventions..

I am a white arab, and i am a gentleman type of person,
and i think that you know me too by my poetry that i wrote
in front of you and that i posted here, but i am
also a more serious computer developer, and i am also
an inventor who has invented many scalable algorithms, read about
them on my writing below:

Here is my last scalable algorithm invention, read
what i have just responded in comp.programming.threads:

About my LRU scalable algorithm..

On 10/16/2019 7:48 AM, Bonita Montero on comp.programming.threads wrote:
> Amine, a quest for you:
> Database-servers and operating-system-kernels mostly use LRU as
> the scheme to evict old buffers from their cache. One issue with
> LRU is, that an LRU-structure can't be updated by multiple threads
> simultaneously. So you have to have a global lock.
> I managed to write a LRU-caching-class that can update the links
> in the LRU-list to put the most recent fetched block to the head
> of the list without any lock in almost any acccess. Only when
> flushing an entry or inserting a new I have to lock the structure
> completely; but in contrast to cache-hits this has usually a mag-
> nitude lower frequency because of the slowness of disk-/ssd-access,
> so this doesn't relly hurt.
> The algorithm is partitiylly locked, partitially lock-free. Even
> the case when putting cache hits to the head has to be processed
> in locked mode in very rare cases. And as I said inserting and
> flushing is conventional locked access.
> So the quest is for you: Can you guess what I did?

And here is what i have just responded:

I think i am also smart, so i have just quickly found a solution that is
scalable and that is not your solution, so it needs my hashtable that is
scalable and it needs my fully scalable FIFO queue that i have invented.
And i think i will not patent it. But my solution is not Lockfree, it
uses locks like in a Lock striping manner and it is scalable.

And read about my other scalable algorithms inventions on my writing below:

About the buffer overflow problem..

I wrote yesterday about buffer overflow in Delphi and Freepascal..

I think there is a "higher" abstraction in Delphi and Freepascal
that does the job very well of avoiding buffer overflow, and it is
the TMemoryStream class, since it behaves also like a pointer
and it supports reallocmem() and freemem() on the pointer but
with a higher level abstraction, look for example at my
following example in Delphi and Freepascal, you will notice
that contrary to pointers , that the memory stream is adapting with
writebuffer() without the need of reserving the memory, and this is why
it avoids the buffer overflow problem, read the following example to
notice how i am using it with a PAnsichar type:

========================================

Program test;

uses system.classes,system.sysutils;

var P: PAnsiChar;

Begin

P:='Amine';

mem:=TMemorystream.create;

mem.position:=0;

mem.writebuffer(pointer(p)^,6);

mem.position:=0;

writeln(PAnsichar(mem.memory));

end.

===================================

So since Delphi and Freepascal also detect the buffer overflow on
dynamic arrays , so i think that Delphi and Freepascal are powerful
tools.

Read my previous thoughts below to understand more:

And I have just read the following webpage about "Fearless Security:
Memory safety":

https://hacks.mozilla.org/2019/01/fearless-security-memory-safety/

Here is the memory safety problems:

1- Misusing Free (use-after-free, double free)

I have solved this in Delphi and Freepascal by inventing a "Scalable"
reference counting with efficient support for weak references. Read
below about it.

2- Uninitialized variables

This can be detected by the compilers of Delphi and Freepascal.

3- Dereferencing Null pointers

I have solved this in Delphi and Freepascal by inventing a "Scalable"
reference counting with efficient support for weak references. Read
below about it.

4- Buffer overflow and underflow

This has been solved in Delphi by using madExcept, read here about it:

http://help.madshi.net/DebugMm.htm

You can buy it from here:

http://www.madshi.net/

And about race conditions and deadlocks problems and more, read my
following thoughts to understand:

I will reformulate more smartly what about race conditions detection in
Rust, so read it carefully:

You can think of the borrow checker of Rust as a validator for a locking
system: immutable references are shared read locks and mutable
references are exclusive write locks. Under this mental model, accessing
data via two independent write locks is not a safe thing to do, and
modifying data via a write lock while there are readers alive is not
safe either.

So as you are noticing that the "mutable" references in Rust follow the
Read-Write Lock pattern, so this is not good, because it is not like
more fine-grained parallelism that permits us to run the writes in
"parallel" and gain more performance from parallelizing the writes.

Read more about Rust and Delphi and my inventions..

I think the spirit of Rust is like the spirit of ADA, they are
especially designed for the very high standards of safety, like those
of ADA, "but" i don't think we have to fear race conditions that Rust
solve, because i think that race conditions are not so difficult to
avoid when you are a decent knowledgeable programmer in parallel
programming, so you have to understand what i mean, now we have to talk
about the rest of the safety guaranties of Rust, there remain the
problem of Deadlocks, and i think that Rust is not solving this problem,
but i have provided you with an enhanced DelphiConcurrent library for
Delphi and Freepascal that detects deadlocks, and there is also the
Memory Safety guaranties of Rust, here they are:

1- No Null Pointer Dereferences
2- No Dangling Pointers
3- No Buffer Overruns

But notice that I have solved the number 1 and number 2 by inventing my
scalable reference counting with efficient support for weak references
for Delphi and Freepascal, read below to notice it, and for number 3
read my following thoughts to understand:

More about research and software development..

I have just looked at the following new video:

Why is coding so hard...

https://www.youtube.com/watch?v=TAAXwrgd1U8

I am understanding this video, but i have to explain my work:

I am not like this techlead in the video above, because i am also an
"inventor" that has invented many scalable algorithms and there
implementions, i am also inventing effective abstractions, i give you an
example:

Read the following of the senior research scientist that is called Dave
Dice:

Preemption tolerant MCS locks

https://blogs.oracle.com/dave/preemption-tolerant-mcs-locks

As you are noticing he is trying to invent a new lock that is preemption
tolerant, but his lock lacks some important characteristics, this is why
i have just invented a new Fast Mutex that is adaptative and that is
much much better and i think mine is the "best", and i think you will
not find it anywhere, my new Fast Mutex has the following characteristics:

1- Starvation-free
2- Good fairness
3- It keeps efficiently and very low the cache coherence traffic
4- Very good fast path performance (it has the same performance as the
scalable MCS lock when there is contention.)
5- And it has a decent preemption tolerance.

this is how i am an "inventor", and i have also invented other scalable
algorithms such as a scalable reference counting with efficient support
for weak references, and i have invented a fully scalable Threadpool,
and i have also invented a Fully scalable FIFO queue, and i have also
invented other scalable algorithms and there inmplementations, and i
think i will sell some of them to Microsoft or to
Google or Embarcadero or such software companies.

Read my following writing to know me more:

More about computing and parallel computing..

The important guaranties of Memory Safety in Rust are:

1- No Null Pointer Dereferences
2- No Dangling Pointers
3- No Buffer Overruns

I think i have solved Null Pointer Dereferences and also solved Dangling
Pointers and also solved memory leaks for Delphi and Freepascal by
inventing my "scalable" reference counting with efficient support for
weak references and i have implemented it in Delphi and Freepascal (Read
about it below), and reference counting in Rust and C++ is "not" scalable.

About the (3) above that is Buffer Overruns, read here about Delphi and
Freepascal:

What's a buffer overflow and how to avoid it in Delphi?

read my above thoughts about it.

About Deadlock and Race conditions in Delphi and Freepascal:

I have ported DelphiConcurrent to Freepascal, and i have
also extended them with the support of my scalable RWLocks for Windows
and Linux and with the support of my scalable lock called MLock for
Windows and Linux and i have also added the support for a Mutex for
Windows and Linux, please look inside the DelphiConcurrent.pas and
FreepascalConcurrent.pas files inside the zip file to understand more.

You can download DelphiConcurrent and FreepascalConcurrent for Delphi
and Freepascal from:

https://sites.google.com/site/scalable68/delphiconcurrent-and-freepascalconcurrent

DelphiConcurrent and FreepascalConcurrent by Moualek Adlene is a new way
to build Delphi applications which involve parallel executed code based
on threads like application servers. DelphiConcurrent provides to the
programmers the internal mechanisms to write safer multi-thread code
while taking a special care of performance and genericity.

In concurrent applications a DEADLOCK may occurs when two threads or
more try to lock two consecutive shared resources or more but in a
different order. With DelphiConcurrent and FreepascalConcurrent, a
DEADLOCK is detected and automatically skipped - before he occurs - and
the programmer has an explicit exception describing the multi-thread
problem instead of a blocking DEADLOCK which freeze the application with
no output log (and perhaps also the linked clients sessions if we talk
about an application server).

Amine Moulay Ramdane has extended them with the support of his scalable
RWLocks for Windows and Linux and with the support of his scalable lock
called MLock for Windows and Linux and he has also added the support for
a Mutex for Windows and Linux, please look inside the
DelphiConcurrent.pas and FreepascalConcurrent.pas files to
understand more.

And please read the html file inside to learn more how to use it.

About race conditions now:

My scalable Adder is here..

As you have noticed i have just posted previously my modified versions
of DelphiConcurrent and FreepascalConcurrent to deal with deadlocks in
parallel programs.

But i have just read the following about how to avoid race conditions in
Parallel programming in most cases..

Here it is:

https://vitaliburkov.wordpress.com/2011/10/28/parallel-programming-with-delphi-part-ii-resolving-race-conditions/

This is why i have invented my following powerful scalable Adder to help
you do the same as the above, please take a look at its source code to
understand more, here it is:

https://sites.google.com/site/scalable68/scalable-adder-for-delphi-and-freepascal

Other than that, about composability of lock-based systems now:

Design your systems to be composable. Among the more galling claims of
the detractors of lock-based systems is the notion that they are somehow
uncomposable:

“Locks and condition variables do not support modular programming,”
reads one typically brazen claim, “building large programs by gluing
together smaller programs[:] locks make this impossible.”9 The claim, of
course, is incorrect. For evidence one need only point at the
composition of lock-based systems such as databases and operating
systems into larger systems that remain entirely unaware of lower-level
locking.

There are two ways to make lock-based systems completely composable, and
each has its own place. First (and most obviously), one can make locking
entirely internal to the subsystem. For example, in concurrent operating
systems, control never returns to user level with in-kernel locks held;
the locks used to implement the system itself are entirely behind the
system call interface that constitutes the interface to the system. More
generally, this model can work whenever a crisp interface exists between
software components: as long as control flow is never returned to the
caller with locks held, the subsystem will remain composable.

Second (and perhaps counterintuitively), one can achieve concurrency and
composability by having no locks whatsoever. In this case, there must be
no global subsystem state—subsystem state must be captured in
per-instance state, and it must be up to consumers of the subsystem to
assure that they do not access their instance in parallel. By leaving
locking up to the client of the subsystem, the subsystem itself can be
used concurrently by different subsystems and in different contexts. A
concrete example of this is the AVL tree implementation used extensively
in the Solaris kernel. As with any balanced binary tree, the
implementation is sufficiently complex to merit componentization, but by
not having any global state, the implementation may be used concurrently
by disjoint subsystems—the only constraint is that manipulation of a
single AVL tree instance must be serialized.

Read more here:

https://queue.acm.org/detail.cfm?id=1454462

And about Message Passing Process Communication Model and Shared Memory
Process Communication Model:

An advantage of shared memory model is that memory communication is
faster as compared to the message passing model on the same machine.

Read the following to notice it:

Why did Windows NT move away from the microkernel?

"The main reason that Windows NT became a hybrid kernel is speed. A
microkernel-based system puts only the bare minimum system components in
the kernel and runs the rest of them as user mode processes, known as
servers. A form of inter-process communication (IPC), usually message
passing, is used for communication between servers and the kernel.

Microkernel-based systems are more stable than others; if a server
crashes, it can be restarted without affecting the entire system, which
couldn't be done if every system component was part of the kernel.
However, because of the overhead incurred by IPC and context-switching,
microkernels are slower than traditional kernels. Due to the performance
costs of a microkernel, Microsoft decided to keep the structure of a
microkernel, but run the system components in kernel space. Starting in
Windows Vista, some drivers are also run in user mode."

More about message passing..

An advantage of shared memory model is that memory communication is
faster as compared to the message passing model on the same machine.

Read the following to notice it:

"One problem that plagues microkernel implementations is relatively poor
performance. The message-passing layer that connects
different operating system components introduces an extra layer of
machine instructions. The machine instruction overhead introduced
by the message-passing subsystem manifests itself as additional
execution time. In a monolithic system, if a kernel component needs
to talk to another component, it can make direct function calls
instead of going through a third party."

However, shared memory model may create problems such as synchronization
and memory protection that need to be addressed.

Message passing’s major flaw is the inversion of control–it is a moral
equivalent of gotos in un-structured programming (it’s about time
somebody said that message passing is considered harmful).

Also some research shows that the total effort to write an MPI
application is significantly higher than that required to write a
shared-memory version of it.

And more about my scalable reference counting with efficient support for
weak references:

My invention that is my scalable reference counting with efficient
support for weak references version 1.37 is here..

Here i am again, i have just updated my scalable reference counting with
efficient support for weak references to version 1.37, I have just added
a TAMInterfacedPersistent that is a scalable reference counted version,
and now i think i have just made it complete and powerful.

Because I have just read the following web page:

https://www.codeproject.com/Articles/1252175/Fixing-Delphis-Interface-Limitations

But i don't agree with the writting of the guy of the above web page,
because i think you have to understand the "spirit" of Delphi, here is why:

A component is supposed to be owned and destroyed by something else,
"typically" a form (and "typically" means in english: in "most" cases,
and this is the most important thing to understand). In that scenario,
reference count is not used.

If you pass a component as an interface reference, it would be very
unfortunate if it was destroyed when the method returns.

Therefore, reference counting in TComponent has been removed.

Also because i have just added TAMInterfacedPersistent to my invention.

To use scalable reference counting with Delphi and FreePascal, just
replace TInterfacedObject with my TAMInterfacedObject that is the
scalable reference counted version, and just replace
TInterfacedPersistent with my TAMInterfacedPersistent that is the
scalable reference counted version, and you will find both my
TAMInterfacedObject and my TAMInterfacedPersistent
inside the AMInterfacedObject.pas file, and to know how to use weak
references please take a look at the demo that i have included called
example.dpr and look inside my zip file at the tutorial about weak
references, and to know how to use delegation take a look at the demo
that i have included called test_delegation.pas, and take a look inside
my zip file at the tutorial about delegation that learns you how to use
delegation.

I think my Scalable reference counting with efficient support for
weak references is stable and fast, and it works on both Windows and
Linux, and my scalable reference counting scales on multicore and NUMA
systems, and you will not find it in C++ or Rust, and i don't think you
will find it anywhere, and you have to know that this invention of mine
solves the problem of dangling pointers and it solves the problem of
memory leaks and my scalable reference counting is "scalable".

And please read the readme file inside the zip file that i have just
extended to make you understand more.

You can download my new scalable reference counting with efficient
support for weak references version 1.37 from:

https://sites.google.com/site/scalable68/scalable-reference-counting-with-efficient-support-for-weak-references

And now i will talk about data dependency and parallel loops..

For a loop to be parallelized, every iteration must be independent of
the others, one way to be sure of it is to execute the loop
in the direction of the incremented index of the loop and in the
direction of the decremented index of the loop and verify if the results
are the same. A data dependency happens if memory is modified: a loop
has a data dependency if an iteration writes a variable that is read or
write in another iteration of the loop. There is no data dependency if
only one iteration reads or writes a variable or if many iterations read
the same variable without modifying it. So this is the "general" "rules".

Now there remains to know that you have for example to know how to
construct the parallel for loop if there is an induction variable or if
there is a reduction operation, i will give an example of them:

If we have the following (the code looks like Algol or modern Object
Pascal):

IND:=0

For I:=1 to N
Do
Begin
IND := IND + 1;
A[I]:=B[IND];
End;

So as you are noticing since IND is an induction variable , so
to parallelize the loop you have to do the following:

For I:=1 to N
Do
Begin
IND:=(I*(I+1))/2;
A[I]:=B[IND];
End;

Now for the reduction operation example, you will notice that my
invention that is my Threadpool with priorities that scales very well (
read about it below) supports a Parallel For that scales very well that
supports "grainsize", and you will notice that the grainsize can be used
in the ParallelFor() with a reduction operation and you will notice that
my following powerful scalable Adder is also used in this scenario, here
it is:

https://sites.google.com/site/scalable68/scalable-adder-for-delphi-and-freepascal

So here is the example with a reduction operation in modern Object Pascal:

TOTAL:=0.0
For I := 1 to N
Do
Begin
TOTAL:=TOTAL+A[I]
End;

So with my powerful scalable Adder and with my powerful invention that
is my ParallelFor() that scales very well, you will parallelize the
above like this:

procedure test1(j:integer;ptr:pointer);
begin

t.add(A[J]); // "t" is my scalable Adder object

end;

// Let's suppose that N is 100000
// In the following, 10000 is the grainsize

obj.ParallelFor(1,N,test1,10000,pointer(0));

TOTAL:=T.get();

And read the following to understand how to use grainsize of my Parallel
for that scales well:

About my ParallelFor() that scales very well that uses my efficient
Threadpool that scales very well:

With ParallelFor() you have to:

1- Ensure Sufficient Work

Each iteration of a loop involves a certain amount of work,
so you have to ensure a sufficient amount of the work,
read below about "grainsize" that i have implemented.

2- In OpenMP we have that:

Static and Dynamic Scheduling

One basic characteristic of a loop schedule is whether it is static or
dynamic:

• In a static schedule, the choice of which thread performs a particular
iteration is purely a function of the iteration number and number of
threads. Each thread performs only the iterations assigned to it at the
beginning of the loop.

• In a dynamic schedule, the assignment of iterations to threads can
vary at runtime from one execution to another. Not all iterations are
assigned to threads at the start of the loop. Instead, each thread
requests more iterations after it has completed the work already
assigned to it.

But with my ParallelFor() that scales very well, since it is using my
efficient Threadpool that scales very well, so it is using Round-robin
scheduling and it uses also work stealing, so i think that this is
sufficient.

Read the rest:

My Threadpool engine with priorities that scales very well is really
powerful because it scales very well on multicore and NUMA systems, also
it comes with a ParallelFor() that scales very well on multicores and
NUMA systems.

You can download it from:

https://sites.google.com/site/scalable68/an-efficient-threadpool-engine-with-priorities-that-scales-very-well

Here is the explanation of my ParallelFor() that scales very well:

I have also implemented a ParallelFor() that scales very well, here is
the method:

procedure ParallelFor(nMin, nMax:integer;aProc:
TParallelProc;GrainSize:integer=1;Ptr:pointer=nil;pmode:TParallelMode=pmBlocking;Priority:TPriorities=NORMAL_PRIORITY);

nMin and nMax parameters of the ParallelFor() are the minimum and
maximum integer values of the variable of the ParallelFor() loop, aProc
parameter of ParallelFor() is the procedure to call, and GrainSize
integer parameter of ParallelFor() is the following:

The grainsize sets a minimum threshold for parallelization.

A rule of thumb is that grainsize iterations should take at least
100,000 clock cycles to execute.

For example, if a single iteration takes 100 clocks, then the grainsize
needs to be at least 1000 iterations. When in doubt, do the following
experiment:

1- Set the grainsize parameter higher than necessary. The grainsize is
specified in units of loop iterations.

If you have no idea of how many clock cycles an iteration might take,
start with grainsize=100,000.

The rationale is that each iteration normally requires at least one
clock per iteration. In most cases, step 3 will guide you to a much
smaller value.

2- Run your algorithm.

3- Iteratively halve the grainsize parameter and see how much the
algorithm slows down or speeds up as the value decreases.

A drawback of setting a grainsize too high is that it can reduce
parallelism. For example, if the grainsize is 1000 and the loop has 2000
iterations, the ParallelFor() method distributes the loop across only
two processors, even if more are available.

And you can pass a parameter in Ptr as pointer to ParallelFor(), and you
can set pmode parameter of to pmBlocking so that ParallelFor() is
blocking or to pmNonBlocking so that ParallelFor() is non-blocking, and
the Priority parameter is the priority of ParallelFor(). Look inside the
test.pas example to see how to use it.

Wisdom90

unread,

Nov 16, 2019, 12:10:45 PM11/16/19

to

On 11/16/2019 11:55 AM, Bonita Montero wrote:

I don't like C++, because it has inherited from C, and it is still
a lower level language, because it is not even strictly typed,
and it favors speed this is why it is not as high level as other
languages.

I understand you, but i think you are not realistic..

Because i think that Delphi is still powerful, since i have worked
with it and i think that it is still powerful, what i also like in

Delphi is that it is strictly typed language, but i think that C++ has

inherited from C and it is not strictly typed as Delphi, and also Delphi

is good at readability and more.

You can download it from:

Bonita Montero

unread,

Nov 16, 2019, 12:35:34 PM11/16/19

to

>> That's like making a steam locomotive more powerful _today_.

> I understand you, but i think you are not realistic..

That's realistic.
You want to provide code in a language others don't use.

Bonita Montero

unread,

Nov 16, 2019, 12:36:23 PM11/16/19

to

> I don't like C++, because it has inherited from C, and it is still
> a lower level language, because it is not even strictly typed,
> and it favors speed this is why it is not as high level as other
> languages.

C++ is a lowlevel as well as a highlevel-language.

> Because i think that Delphi is still powerful, since i have worked
> with it and i think that it is still powerful, what i also like in
> Delphi is that it is strictly typed language, but i think that C++ has
> inherited from C and it is not strictly typed as Delphi, and also Delphi
> is good at readability and more.

Delphi is dead.

Bonita Montero

unread,

Nov 16, 2019, 2:21:20 PM11/16/19

to

> I don't like C++, because it has inherited from C, and it is still
> a lower level language, because it is not even strictly typed,
> and it favors speed this is why it is not as high level as other
> languages.

Another idea: learn Rust! Rust is a very clean language with highlevel
semantics that prevent a lot of errors as well as methods to write
unsafe code that have lowlevel-semantics. Rust will get an importance
in the future and maybe will a contender for C++.
The only thing I don't like with Rust is that it doesn't have
exceptions.

Wisdom90

unread,

Nov 16, 2019, 4:01:48 PM11/16/19

to

I think here is the problem with Rust:

I will reformulate more smartly what about race conditions detection in
Rust, so read it carefully:

You can think of the borrow checker of Rust as a validator for a locking
system: immutable references are shared read locks and mutable
references are exclusive write locks. Under this mental model, accessing
data via two independent write locks is not a safe thing to do, and
modifying data via a write lock while there are readers alive is not
safe either.

So as you are noticing that the "mutable" references in Rust follow the
Read-Write Lock pattern, so this is not good, because it is not like
more fine-grained parallelism that permits us to run the writes in
"parallel" and gain more performance from parallelizing the writes.