119 views

Skip to first unread message

May 18, 2021, 4:44:24 PM5/18/21

to

Hello...

Here is my just new invention of a scalable algorithm and my other new inventions..

I am a white arab, and i think i am smart since i have also

invented many scalable algorithms and algorithms..

I have just read the following PhD paper about the invention that we call counting networks and they are better than Software combining trees:

Counting Networks

http://people.csail.mit.edu/shanir/publications/AHS.pdf

And i have read the following PhD paper:

http://people.csail.mit.edu/shanir/publications/HLS.pdf

So as you are noticing they are saying in the conclusion that:

"Software combining trees and counting networks which are the only techniques we observed to be truly scalable"

But i just found that this counting networks algorithm is not generally scalable, and i have the logical proof here, this is why i have just come with a new invention that enhance the counting networks algorithm to be generally scalable. So you have to be careful with the actual counting networks algorithm that is not generally scalable.

More philosophy about my kind of works..

I just written the following:

--

More philosophy about my way of doing..

You have to know me more, since i have just posted about Computer Science vs Software Engineering, but i am not like

Computer Science or Software Engineering, because i am an inventor

of many software scalable algorithms and algorithms, and i have invented some powerful software tools, so my way of doing is being innovative and creative and inventive, so i am like a PhD researcher, and i am writing some books about my inventions and about my powerful tools etc.

--

I will give an example of how i am an inventive and creative, i have just read the following book (and of other books like it) of a PhD researcher about operational research and capacity planning, here they are:

Performance by Design: Computer Capacity Planning by Example

https://www.amazon.ca/Performance-Design-Computer-Capacity-Planning/dp/0130906735

So i have just found that there methodologies of those PhD researchers for the E-Business service don't work, because they are doing calculations for a given arrival rate that is statistically and empirically measured from the behavior of customers, but i think that it is not correct, so i am being inventive and i have come with my new methodology that fixes the arrival rate from the data by using an hyperexponential service distribution(and it is mathematical) since it is also good for Denial-of-Service (DoS) attacks and i will write a powerful book about it that will teach my new methodology and i will also explain the mathematics behind it and i will sell it, and my new methodology will work for cloud computing and for computer servers.

More about my inventions of scalable algorithms..

More precision about my new inventions of scalable algorithms..

And look at my below powerful inventions of LW_Fast_RWLockX and Fast_RWLockX that are two powerful scalable RWLocks that are FIFO fair

and Starvation-free and costless on the reader side

(that means with no atomics and with no fences on the reader side), they use sys_membarrier expedited on Linux and FlushProcessWriteBuffers() on windows, and if you look at the source code of my LW_Fast_RWLockX.pas

and Fast_RWLockX.pas inside the zip file, you will notice that in Linux they call two functions that are membarrier1() and membarrier2(), the membarrier1() registers the process's intent to use MEMBARRIER_CMD_PRIVATE_EXPEDITED and membarrier2() executes a memory barrier on each running thread belonging to the same process as the calling thread.

Read more here to understand:

https://man7.org/linux/man-pages/man2/membarrier.2.html

Here is my new powerful inventions of scalable algorithms..

I have just updated my powerful inventions of LW_Fast_RWLockX and Fast_RWLockX that are two powerful scalable RWLocks that are FIFO fair

and Starvation-free and costless on the reader side (that means with no atomics and with no fences on the reader side), they use sys_membarrier expedited on Linux and FlushProcessWriteBuffers() on windows, and now they work with both Linux and Windows, and i think my inventions are really smart, since read the following PhD researcher,

he says the following:

"Until today, there is no known efficient reader-writer lock with starvation-freedom guarantees;"

Read more here:

http://concurrencyfreaks.blogspot.com/2019/04/onefile-and-tail-latency.html

So as you have just noticed he says the following:

"Until today, there is no known efficient reader-writer lock with starvation-freedom guarantees;"

So i think that my above powerful inventions of scalable reader-writer locks are efficient and FIFO fair and Starvation-free.

LW_Fast_RWLockX that is a lightweight scalable Reader-Writer Mutex that uses a technic that looks like Seqlock without looping on the reader side like Seqlock, and this has permitted the reader side to be costless, it is fair and it is of course Starvation-free and it does spin-wait, and also Fast_RWLockX a lightweight scalable Reader-Writer Mutex that uses a technic that looks like Seqlock without looping on the reader side like Seqlock, and this has permitted the reader side to be costless, it is fair and it is of course Starvation-free and it does not spin-wait, but waits on my SemaMonitor, so it is energy efficient.

You can read about them and download them from my website here:

https://sites.google.com/site/scalable68/scalable-rwlock

About the Linux sys_membarrier() expedited and the windows FlushProcessWriteBuffers()..

I have just read the following webpage:

https://lwn.net/Articles/636878/

And it is interesting and it says:

---

Results in liburcu:

Operations in 10s, 6 readers, 2 writers:

memory barriers in reader: 1701557485 reads, 3129842 writes

signal-based scheme: 9825306874 reads, 5386 writes

sys_membarrier expedited: 6637539697 reads, 852129 writes

sys_membarrier non-expedited: 7992076602 reads, 220 writes

---

Look at how "sys_membarrier expedited" is powerful.

Cache-coherency protocols do not use IPIs, and as a user-space level developer you do not care about IPIs at all. One is most interested in the cost of cache-coherency itself. However, Win32 API provides a function that issues IPIs to all processors (in the affinity mask of the current process) FlushProcessWriteBuffers(). You can use it to investigate the cost of IPIs.

When i do simple synthetic test on a dual core machine I've obtained following numbers.

420 cycles is the minimum cost of the FlushProcessWriteBuffers() function on issuing core.

1600 cycles is mean cost of the FlushProcessWriteBuffers() function on issuing core.

1300 cycles is mean cost of the FlushProcessWriteBuffers() function on remote core.

Note that, as far as I understand, the function issues IPI to remote core, then remote core acks it with another IPI, issuing core waits for ack IPI and then returns.

And the IPIs have indirect cost of flushing the processor pipeline.

More about WaitAny() and WaitAll() and more..

Look at the following concurrency abstractions of Microsoft:

https://docs.microsoft.com/en-us/dotnet/api/system.threading.tasks.task.waitany?view=netframework-4.8

https://docs.microsoft.com/en-us/dotnet/api/system.threading.tasks.task.waitall?view=netframework-4.8

They look like the following WaitForAny() and WaitForAll() of Delphi, here they are:

http://docwiki.embarcadero.com/Libraries/Sydney/en/System.Threading.TTask.WaitForAny

http://docwiki.embarcadero.com/Libraries/Sydney/en/System.Threading.TTask.WaitForAll

So the WaitForAll() is easy and i have implemented it in my Threadpool engine that scales very well and that i have invented, you can read my html tutorial inside The zip file of it to know how to do it, you can download it from my website here:

https://sites.google.com/site/scalable68/an-efficient-threadpool-engine-with-priorities-that-scales-very-well

And about the WaitForAny(), you can also do it using my SemaMonitor,

and i will soon give you an example of how to do it, and you can download my SemaMonitor invention from my website here:

https://sites.google.com/site/scalable68/semacondvar-semamonitor

Here is my other just new software inventions..

I have just looked at the source code of the following multiplatform pevents

https://github.com/neosmart/pevents

And notice that the WaitForMultipleEvents() is implemented with pthread

but it is not scalable on multicores. So i have just invented a WaitForMultipleObjects() that looks like the Windows WaitForMultipleObjects() and that is fully "scalable" on multicores and that works on Windows and Linux and MacOSX and that is blocking when waiting for the objects as WaitForMultipleObjects(), so it doesn't consume CPU cycles when waiting and it works with events and futures and tasks.

Here is my other just new software inventions..

I have just invented a fully "scalable" on multicores latch and a

fully scalable on multicores thread barrier, they are really powerful.

Read about the latches and thread barriers that are not scalable on

multicores of C++ here:

https://www.modernescpp.com/index.php/latches-and-barriers

Here is my other software inventions:

More about my scalable math Linear System Solver Library...

As you have just noticed i have just spoken about my Linear System Solver Library(read below), right now it scales very well, but i will

soon make it "fully" scalable on multicores using one of my scalable algorithm that i have invented and i will extend it much more to also support efficient scalable on multicores matrix operations and more, and since it will come with one of my scalable algorithms that i have invented, i think i will sell it too.

More about mathematics and about scalable Linear System Solver Libraries and more..

I have just noticed that a software architect from Austria

called Michael Rabatscher has designed and implemented MrMath Library that is also a parallelized Library:

Here he is:

https://at.linkedin.com/in/michael-rabatscher-6821702b

And here is his MrMath Library for Delphi and Freepascal:

https://github.com/mikerabat/mrmath

But i think that he is not so smart, and i think i am smart like

a genius and i say that his MrMath Library is not scalable on multicores, and notice that the Linear System Solver of his MrMath Library is not scalable on multicores too, and notice that the threaded matrix operations of his Library are not scalable on multicores too, this is why i have invented a scalable on multicores Conjugate Gradient Linear System Solver Library for C++ and Delphi and Freepascal, and here it is, read about it in my following thoughts(also i will soon extend more my Library to support scalable matrix operations):

About SOR and Conjugate gradient mathematical methods..

I have just looked at SOR(Successive Overrelaxation Method),

and i think it is much less powerful than Conjugate gradient method,

read the following to notice it:

COMPARATIVE PERFORMANCE OF THE CONJUGATE GRADIENT AND SOR METHODS

FOR COMPUTATIONAL THERMAL HYDRAULICS

https://inis.iaea.org/collection/NCLCollectionStore/_Public/19/055/19055644.pdf?r=1&r=1

This is why i have implemented in both C++ and Delphi my Parallel Conjugate Gradient Linear System Solver Library that scales very well, read my following thoughts about it to understand more:

About the convergence properties of the conjugate gradient method

The conjugate gradient method can theoretically be viewed as a direct method, as it produces the exact solution after a finite number of iterations, which is not larger than the size of the matrix, in the absence of round-off error. However, the conjugate gradient method is unstable with respect to even small perturbations, e.g., most directions are not in practice conjugate, and the exact solution is never obtained. Fortunately, the conjugate gradient method can be used as an iterative method as it provides monotonically improving approximations to the exact solution, which may reach the required tolerance after a relatively small (compared to the problem size) number of iterations. The improvement is typically linear and its speed is determined by the condition number κ(A) of the system matrix A: the larger is κ(A), the slower the improvement.

Read more here:

http://pages.stat.wisc.edu/~wahba/stat860public/pdf1/cj.pdf

So i think my Conjugate Gradient Linear System Solver Library

that scales very well is still very useful, read about it

in my writing below:

Read the following interesting news:

The finite element method finds its place in games

Read more here:

https://translate.google.com/translate?hl=en&sl=auto&tl=en&u=https%3A%2F%2Fhpc.developpez.com%2Factu%2F288260%2FLa-methode-des-elements-finis-trouve-sa-place-dans-les-jeux-AMD-propose-la-bibliotheque-FEMFX-pour-une-simulation-en-temps-reel-des-deformations%2F

But you have to be aware that finite element method uses Conjugate Gradient Method for Solution of Finite Element Problems, read here to notice it:

Conjugate Gradient Method for Solution of Large Finite Element Problems on CPU and GPU

https://pdfs.semanticscholar.org/1f4c/f080ee622aa02623b35eda947fbc169b199d.pdf

This is why i have also designed and implemented my Parallel Conjugate Gradient Linear System Solver library that scales very well,

here it is:

My Parallel C++ Conjugate Gradient Linear System Solver Library

that scales very well version 1.76 is here..

Author: Amine Moulay Ramdane

Description:

This library contains a Parallel implementation of Conjugate Gradient Dense Linear System Solver library that is NUMA-aware and cache-aware that scales very well, and it contains also a Parallel implementation of Conjugate Gradient Sparse Linear System Solver library that is cache-aware that scales very well.

Sparse linear system solvers are ubiquitous in high performance computing (HPC) and often are the most computational intensive parts in scientific computing codes. A few of the many applications relying on sparse linear solvers include fusion energy simulation, space weather simulation, climate modeling, and environmental modeling, and finite element method, and large-scale reservoir simulations to enhance oil recovery by the oil and gas industry.

Conjugate Gradient is known to converge to the exact solution in n steps for a matrix of size n, and was historically first seen as a direct method because of this. However, after a while people figured out that it works really well if you just stop the iteration much earlier - often you will get a very good approximation after much fewer than n steps. In fact, we can analyze how fast Conjugate gradient converges. The end result is that Conjugate gradient is used as an iterative method for large linear systems today.

Please download the zip file and read the readme file inside the zip to know how to use it.

You can download it from:

https://sites.google.com/site/scalable68/scalable-parallel-c-conjugate-gradient-linear-system-solver-library

Language: GNU C++ and Visual C++ and C++Builder

Operating Systems: Windows, Linux, Unix and Mac OS X on (x86)

--

Thread Barrier for Delphi and Freepascal version 1.0 is here..

I have added my condition variable implementation and my scalable Lock called scalable MLock that both work with both Windows and Linux and i have made the Thread Barrier work with both Windows and Linux, and now you can pass a parameter to the constructor of the Thread Barrier as ctMutex to use a Mutex or ctMLock to use a scalable Lock called MLock or ctCriticalSection to use a Critical Section.

You can download it from my website here:

https://sites.google.com/site/scalable68/thread-barrier-for-delphi-and-freepascal

Yet more precision about my inventions that are my SemaMonitor and SemaCondvar and my Monitor..

My inventions that are my SemaMonitor and SemaCondvar are fast pathed when the count of my SemaMonitor or my SemaCondvar is greater than 0, so in this case the wait() method stays on the user mode and it doesn't switch from user mode to kernel mode that costs around 1500 CPU cycles and that is expensive, the signal() method is also fast pathed when there is no item in the queue and count is less than MaximumCount, read here about what is the cost (in CPU cycles) to switch between windows user mode and kernel mode:

https://stackoverflow.com/questions/1368061/whats-the-cost-in-cycles-to-switch-between-windows-kernel-and-user-mode#:~:text=1%20Answer&text=Switching%20from%20%E2%80%9Cuser%20mode%E2%80%9D%20to,rest%20is%20%22kernel%20overhead%22.

You can read about and download my inventions of SemaMonitor and SemaCondvar from here:

https://sites.google.com/site/scalable68/semacondvar-semamonitor

And the light weight version is here:

https://sites.google.com/site/scalable68/light-weight-semacondvar-semamonitor

And i have implemented an efficient Monitor over my SemaCondvar.

Here is the description of my efficient Monitor inside the Monitor.pas file that you will find inside the zip file:

Description:

This is my implementation of a Monitor over my SemaCondvar.

You will find the Monitor class inside the Monitor.pas file inside the zip file.

When you set the first parameter of the constructor to true, the signal will not be lost if the threads are not waiting with wait() method, but when you set the first parameter of the construtor to false, if the threads are not waiting with the wait() method, the signal will be lost..

Second parameter of the constructor is the kind of Lock, you can set it to ctMLock to use my scalable node based lock called MLock, or you can set it to ctMutex to use a Mutex or you can set it to ctCriticalSection to use the TCriticalSection.

Here is the methods of my efficient Monitor that i have implemented:

TMonitor = class

private

cache0:typecache0;

lock1:TSyncLock;

obj:TSemaCondvar;

cache1:typecache0;

public

constructor Create(bool:boolean=true;lock:TMyLocks=ctMLock);

destructor Destroy; override;

procedure Enter();

procedure Leave();

function Signal():boolean;overload;

function Signal(nbr:long;var remains:long):boolean;overload;

procedure Signal_All();

function Wait(const AMilliseconds:longword=INFINITE): boolean;

function WaitersBlocked():long;

end;

The wait() method is for the threads to wait on the Monitor object for

the signal to be signaled. If wait() fails, that can be that the number

of waiters is greater than high(longword).

And the signal() method will signal one time a waiting thread on the

Monitor object, but if signal() fails , the returned value is false.

the signal_all() method will signal all the waiting threads on

the Monitor object.

The signal(nbr:long;var remains:long) method will signal nbr of

waiting threads, but if signal() fails, the remaining number of signals

that were not signaled will be returned in the remains variable.

and WaitersBlocked() will return the number of waiting threads on

the Monitor object.

and Enter() and Leave() methods to enter and leave the monitor's Lock.

You can download the zip files from:

https://sites.google.com/site/scalable68/semacondvar-semamonitor

and the lightweight version is here:

https://sites.google.com/site/scalable68/light-weight-semacondvar-semamonitor

More about my powerful inventions of scalable reference counting algorithm and of my scalable algorithms..

I invite you to read the following web page:

Why is memory reclamation so important?

https://concurrencyfreaks.blogspot.com/search?q=resilience+and+urcu

Notice that it is saying the following about RCU:

"Reason number 4, resilience

Another reason to go with lock-free/wait-free data structures is because they are resilient to failures. On a shared memory system with multiples processes accessing the same data structure, even if one of the processes dies, the others will be able to progress in their work. This is the true gem of lock-free data structures: progress in the presence of failure. Blocking data structures (typically) do not have this property (there are exceptions though). If we add a blocking memory reclamation (like URCU) to a lock-free/wait-free data structure, we are loosing this resilience because one dead process will prevent further memory reclamation and eventually bring down the whole system.

There goes the resilience advantage out the window."

So i think that RCU can not be used as reference counting,

since it is blocking on the writer side, so it is not resilient to failures since it is not lock-free on the writer side.

So this is why i have invented my powerful Scalable reference counting with efficient support for weak references that is lock-free for its scalable reference counting, and here it is:

https://sites.google.com/site/scalable68/scalable-reference-counting-with-efficient-support-for-weak-references

And my scalable reference counting algorithm is of the SCU(0,1) Class of Algorithms, so under scheduling conditions which approximate those found in commercial hardware architectures, it becomes wait-free with a system latency of time O(sqrt(k)) and with an individual latency of O(k*sqrt(k)), and k number of threads.

The proof is here on the following PhD paper:

https://arxiv.org/pdf/1311.3200.pdf

This paper suggests a simple solution to this problem. We show that, for a large class of lock- free algorithms, under scheduling conditions which approximate those found in commercial hardware architectures, lock-free algorithms behave as if they are wait-free. In other words, programmers can keep on designing simple lock-free algorithms instead of complex wait-free ones, and in practice, they will get wait-free progress. It says on the Analysis of the Class SCU(q, s):

"Given an algorithm in SCU(q, s) on k correct processes under a uniform stochastic scheduler, the system latency is O(q + s*sqrt(k), and the individual latency is O(k(q + s*sqrt(k))."

More precision about my new inventions of scalable algorithms..

And look at my below powerful inventions of LW_Fast_RWLockX and Fast_RWLockX that are two powerful scalable RWLocks that are FIFO fair

and Starvation-free and costless on the reader side

(that means with no atomics and with no fences on the reader side), they use sys_membarrier expedited on Linux and FlushProcessWriteBuffers() on windows, and if you look at the source code of my LW_Fast_RWLockX.pas

and Fast_RWLockX.pas inside the zip file, you will notice that in Linux they call two functions that are membarrier1() and membarrier2(), the membarrier1() registers the process's intent to use MEMBARRIER_CMD_PRIVATE_EXPEDITED and membarrier2() executes a memory barrier on each running thread belonging to the same process as the calling thread.

Read more here to understand:

https://man7.org/linux/man-pages/man2/membarrier.2.html

Here is my new powerful inventions of scalable algorithms..

I have just updated my powerful inventions of LW_Fast_RWLockX and Fast_RWLockX that are two powerful scalable RWLocks that are FIFO fair

and Starvation-free and costless on the reader side (that means with no atomics and with no fences on the reader side), they use sys_membarrier expedited on Linux and FlushProcessWriteBuffers() on windows, and now they work with both Linux and Windows, and i think my inventions are really smart, since read the following PhD researcher,

he says the following:

"Until today, there is no known efficient reader-writer lock with starvation-freedom guarantees;"

Read more here:

http://concurrencyfreaks.blogspot.com/2019/04/onefile-and-tail-latency.html

So as you have just noticed he says the following:

"Until today, there is no known efficient reader-writer lock with starvation-freedom guarantees;"

So i think that my above powerful inventions of scalable reader-writer locks are efficient and FIFO fair and Starvation-free.

LW_Fast_RWLockX that is a lightweight scalable Reader-Writer Mutex that uses a technic that looks like Seqlock without looping on the reader side like Seqlock, and this has permitted the reader side to be costless, it is fair and it is of course Starvation-free and it does spin-wait, and also Fast_RWLockX a lightweight scalable Reader-Writer Mutex that uses a technic that looks like Seqlock without looping on the reader side like Seqlock, and this has permitted the reader side to be costless, it is fair and it is of course Starvation-free and it does not spin-wait, but waits on my SemaMonitor, so it is energy efficient.

You can read about them and download them from my website here:

https://sites.google.com/site/scalable68/scalable-rwlock

Also my other inventions are the following scalable RWLocks that are

FIFO fair and starvation-free:

Here is my invention of a scalable and starvation-free and FIFO fair and lightweight Multiple-Readers-Exclusive-Writer Lock called LW_RWLockX, it works across processes and threads:

https://sites.google.com/site/scalable68/scalable-rwlock-that-works-accross-processes-and-threads

And here is my inventions of New variants of Scalable RWLocks that are FIFO fair and Starvation-free:

https://sites.google.com/site/scalable68/new-variants-of-scalable-rwlocks

More about the energy efficiency of Transactional memory and more..

I have just read the following PhD paper, it is also about energy efficiency of Transactional memory, here it is:

Techniques for Enhancing the Efficiency of Transactional Memory Systems

http://kth.diva-portal.org/smash/get/diva2:1258335/FULLTEXT02.pdf

And i think it is the best known energy efficient algorithm for

Transactional memory, but i think it is not good, since

look at how for 64 cores the Beta parameter can be 16 cores,

so i think i am smart and i have just invented a much more energy efficient and powerful scalable fast Mutex and i have also just invented scalable RWLocks that are starvation-free and fair, read about them in my below writing and thoughts:

More about deadlocks and lock-based systems and more..

I have just read the following from an software engineer from Quebec Canada:

A deadlock-detecting mutex

https://faouellet.github.io/ddmutex/

And i have just understood rapidly his algorithm, but i think

his algorithm is not efficient at all, since we can find

if a graph has a strongly connected component in around a time complexity O(V+E), so then the algorithm above of the engineer from Quebec Canada takes around a time complexity of O(n*(V+E)), so it is not good.

So a much better way is to use my following way of detecting deadlocks:

DelphiConcurrent and FreepascalConcurrent are here

Read more here in my website:

https://sites.google.com/site/scalable68/delphiconcurrent-and-freepascalconcurrent

And i will soon enhance much more DelphiConcurrent and FreepascalConcurrent to support both Communication deadlocks

and Resource deadlocks.

About Transactional memory and locks..

I have just read the following paper about Transactional memory and locks:

http://sunnydhillon.net/assets/docs/concurrency-tm.pdf

I don't agree with the above paper, since read my following thoughts

to understand:

I have just invented a new powerful scalable fast mutex, and it has the following characteristics:

1- Starvation-free

2- Tunable fairness

3- It keeps efficiently and very low its cache coherence traffic

4- Very good fast path performance

5- And it has a good preemption tolerance.

6- It is faster than scalable MCS lock

7- It solves the problem of lock convoying

So my new invention also solves the following problem:

The convoy phenomenon

https://blog.acolyer.org/2019/07/01/the-convoy-phenomenon/

And here is my other new invention of a Scalable RWLock that works across processes and threads that is starvation-free and fair and i will soon enhance it much more and it will become really powerful:

https://sites.google.com/site/scalable68/scalable-rwlock-that-works-accross-processes-and-threads

And about Lock-free versus Lock, read my following post:

https://groups.google.com/forum/#!topic/comp.programming.threads/F_cF4ft1Qic

And about deadlocks, here is also how i have solved it, and i will soon enhance much more DelphiConcurrent and FreepacalConcurrent:

DelphiConcurrent and FreepascalConcurrent are here

Read more here in my website:

https://sites.google.com/site/scalable68/delphiconcurrent-and-freepascalconcurrent

So i think with my above scalable fast mutex and my scalable RWLocks

that are starvation-free and fair and by reading the following about composability of lock-based systems, you will notice that lock-based systems are still useful.

"About composability of lock-based systems..

Design your systems to be composable. Among the more galling claims of

the detractors of lock-based systems is the notion that they are somehow

uncomposable: “Locks and condition variables do not support modular

programming,” reads one typically brazen claim, “building large programs

by gluing together smaller programs[:] locks make this impossible.”9 The

claim, of course, is incorrect. For evidence one need only point at the

composition of lock-based systems such as databases and operating

systems into larger systems that remain entirely unaware of lower-level

locking.

There are two ways to make lock-based systems completely composable, and

each has its own place. First (and most obviously), one can make locking

entirely internal to the subsystem. For example, in concurrent operating

systems, control never returns to user level with in-kernel locks held;

the locks used to implement the system itself are entirely behind the

system call interface that constitutes the interface to the system. More

generally, this model can work whenever a crisp interface exists between

software components: as long as control flow is never returned to the

caller with locks held, the subsystem will remain composable.

Second (and perhaps counterintuitively), one can achieve concurrency and

composability by having no locks whatsoever. In this case, there must be

no global subsystem state—subsystem state must be captured in

per-instance state, and it must be up to consumers of the subsystem to

assure that they do not access their instance in parallel. By leaving

locking up to the client of the subsystem, the subsystem itself can be

used concurrently by different subsystems and in different contexts. A

concrete example of this is the AVL tree implementation used extensively

in the Solaris kernel. As with any balanced binary tree, the

implementation is sufficiently complex to merit componentization, but by

not having any global state, the implementation may be used concurrently

by disjoint subsystems—the only constraint is that manipulation of a

single AVL tree instance must be serialized."

Read more here:

https://queue.acm.org/detail.cfm?id=1454462

About mathematics and about abstraction..

I think my specialization is also that i have invented many software algorithms and software scalable algorithms and i am still inventing other software scalable algorithms and algorithms, those scalable algorithms and algorithms that i have invented are like inventing mathematical theorems that you prove and present in a higher level abstraction, but not only that but those algorithms and scalable algorithms of mine are presented in a form of higher level software abstraction that abstract the complexity of my scalable algorithms and algorithms, it is the most important part that interests me, for example notice how i am constructing higher level abstraction in my following tutorial as methodology that, first, permits to model the synchronization objects of parallel programs with logic primitives with If-Then-OR-AND so that to make it easy to translate to Petri nets so that to detect deadlocks in parallel programs, please take a look at it in my following web link because this tutorial of mine is the way of learning by higher level abstraction:

How to analyse parallel applications with Petri Nets

https://sites.google.com/site/scalable68/how-to-analyse-parallel-applications-with-petri-nets

So notice that my methodology is a generalization that solves communication deadlocks and resource deadlocks in parallel programs.

1- Communication deadlocks that result from incorrect use of

event objects or condition variables (i.e. wait-notify

synchronization).

2- Resource deadlocks, a common kind of deadlock in which a set of

threads blocks forever because each thread in the set is waiting to

acquire a lock held by another thread in the set.

This is what interests me in mathematics, i want to work efficiently in mathematics in a much higher level of abstraction, i give you

an example of what i am doing in mathematics so that you understand,

look at how i am implementing mathematics as a software parallel conjugate gradient system solvers that scale very well, and i am presenting them in a higher level of abstraction, this is how i am abstracting the mathematics of them, read the following about it to notice it:

About SOR and Conjugate gradient mathematical methods..

I have just looked at SOR(Successive Overrelaxation Method),

and i think it is much less powerful than Conjugate gradient method,

read the following to notice it:

COMPARATIVE PERFORMANCE OF THE CONJUGATE GRADIENT AND SOR METHODS

FOR COMPUTATIONAL THERMAL HYDRAULICS

https://inis.iaea.org/collection/NCLCollectionStore/_Public/19/055/19055644.pdf?r=1&r=1

This is why i have implemented in both C++ and Delphi my Parallel Conjugate Gradient Linear System Solver Library that scales very well, read my following thoughts about it to understand more:

About the convergence properties of the conjugate gradient method

The conjugate gradient method can theoretically be viewed as a direct method, as it produces the exact solution after a finite number of iterations, which is not larger than the size of the matrix, in the absence of round-off error. However, the conjugate gradient method is unstable with respect to even small perturbations, e.g., most directions are not in practice conjugate, and the exact solution is never obtained. Fortunately, the conjugate gradient method can be used as an iterative method as it provides monotonically improving approximations to the exact solution, which may reach the required tolerance after a relatively small (compared to the problem size) number of iterations. The improvement is typically linear and its speed is determined by the condition number κ(A) of the system matrix A: the larger is κ(A), the slower the improvement.

Read more here:

http://pages.stat.wisc.edu/~wahba/stat860public/pdf1/cj.pdf

So i think my Conjugate Gradient Linear System Solver Library

that scales very well is still very useful, read about it

in my writing below:

Read the following interesting news:

The finite element method finds its place in games

Read more here:

https://translate.google.com/translate?hl=en&sl=auto&tl=en&u=https%3A%2F%2Fhpc.developpez.com%2Factu%2F288260%2FLa-methode-des-elements-finis-trouve-sa-place-dans-les-jeux-AMD-propose-la-bibliotheque-FEMFX-pour-une-simulation-en-temps-reel-des-deformations%2F

But you have to be aware that finite element method uses Conjugate Gradient Method for Solution of Finite Element Problems, read here to notice it:

Conjugate Gradient Method for Solution of Large Finite Element Problems on CPU and GPU

https://pdfs.semanticscholar.org/1f4c/f080ee622aa02623b35eda947fbc169b199d.pdf

This is why i have also designed and implemented my Parallel Conjugate Gradient Linear System Solver library that scales very well,

here it is:

My Parallel C++ Conjugate Gradient Linear System Solver Library

that scales very well version 1.76 is here..

Author: Amine Moulay Ramdane

Description:

This library contains a Parallel implementation of Conjugate Gradient Dense Linear System Solver library that is NUMA-aware and cache-aware that scales very well, and it contains also a Parallel implementation of Conjugate Gradient Sparse Linear System Solver library that is cache-aware that scales very well.

Sparse linear system solvers are ubiquitous in high performance computing (HPC) and often are the most computational intensive parts in scientific computing codes. A few of the many applications relying on sparse linear solvers include fusion energy simulation, space weather simulation, climate modeling, and environmental modeling, and finite element method, and large-scale reservoir simulations to enhance oil recovery by the oil and gas industry.

Conjugate Gradient is known to converge to the exact solution in n steps for a matrix of size n, and was historically first seen as a direct method because of this. However, after a while people figured out that it works really well if you just stop the iteration much earlier - often you will get a very good approximation after much fewer than n steps. In fact, we can analyze how fast Conjugate gradient converges. The end result is that Conjugate gradient is used as an iterative method for large linear systems today.

Please download the zip file and read the readme file inside the zip to know how to use it.

You can download it from:

https://sites.google.com/site/scalable68/scalable-parallel-c-conjugate-gradient-linear-system-solver-library

Language: GNU C++ and Visual C++ and C++Builder

Operating Systems: Windows, Linux, Unix and Mac OS X on (x86)

--

As you have noticed i have just written above about my Parallel C++ Conjugate Gradient Linear System Solver Library that scales very well, but here is my Parallel Delphi and Freepascal Conjugate Gradient Linear System Solvers Libraries that scale very well:

Parallel implementation of Conjugate Gradient Dense Linear System solver library that is NUMA-aware and cache-aware that scales very well

https://sites.google.com/site/scalable68/scalable-parallel-implementation-of-conjugate-gradient-dense-linear-system-solver-library-that-is-numa-aware-and-cache-aware

PARALLEL IMPLEMENTATION OF CONJUGATE GRADIENT SPARSE LINEAR SYSTEM SOLVER LIBRARY THAT SCALES VERY WELL

https://sites.google.com/site/scalable68/scalable-parallel-implementation-of-conjugate-gradient-sparse-linear-system-solver

More of my philosophy about Unix and Linux and more..

I am a white arab and i think i am smart since i have also invented

many scalable algorithms and algorithms..

I invite you to look at the following interesting video:

Unix vs Linux

https://www.youtube.com/watch?v=jowCUo_UGts

My Diploma is a university level Diploma, my school in Morocco where i have studied and gotten my university level Diploma in Microelectronics and informatics was under the control of Paris Academie in France (we call it Académie de Paris), and here it is:

https://translate.google.com/translate?hl=en&sl=auto&tl=en&u=https%3A%2F%2Ffr.wikipedia.org%2Fwiki%2FAcad%25C3%25A9mie_de_Paris

I have started my studies in Microelectronics and informatics in 1986,

and in my studies of informatics in my university level school i have programmed and worked with the following computer that was called Altos 586, it was a Unix system, here it is:

https://en.wikipedia.org/wiki/Altos_586

And i have gotten my university level Diploma in Microelectronics and informatics in 1989.

So as you notice that i know how to program in Unix and in Linux and in Windows too, and as a proof here is my new scalable algorithm invention that i have also ported to Windows and Linux(and you can download the zip file from my website and take a look at source code of it):

https://sites.google.com/site/scalable68/new-variants-of-scalable-rwlocks

And you can look at my other Open source software projects here in

my website:

https://sites.google.com/site/scalable68/

And my today new software invention is the following:

You have to know that a Turing-complete system can be proven mathematically to be capable of performing any possible calculation or computer program, and bash shell for Linux and Windows are Turing-complete, and even if bash shell is not python, it is a minimalist language that is especially designed for administrators of operating systems, but i have noticed that bash shell is not suited for for parallel programming, this is why i am enhancing it with my

scalable algorithms so that to support sophisticated parallel programming on both Linux and Windows that permits it to scale much better on RAIDs and on multicores. So i am also writing a book about my enhancement to bash shell with my scalable algorithms so that to help others be efficient in bash shell programming and efficient in operating system administration, and of course i will sell my book, so i don't think you need python since python doesn't come with my scalable algorithms that will enhance bash for Linux and Windows, and i think operating systems administrators don't need python since it is

not suited for operating system administrators since it is

not a minimalist language as bash for Linux and Windows.

You can read more about bash shell from here:

https://www.infoworld.com/article/2893519/perl-python-ruby-are-nice-bash-is-where-its-at.html

More philosophy about my kind of interests and more about me..

More philosophy of what kind of friends have i ?

I look like a PhD researcher since i am an inventor and i have invented many scalable algorithms and algorithms, and i am still inventing algorithms, and this is why my friends are like PhD researchers, here is one of my friends that is a PhD researcher and Full Professor, he is one of my best friend, i know him for around 23 years:

https://www.usherbrooke.ca/gelecinfo/fr/departement/profs/khoa-fr/khoa-en/

So notice carefully his webpage:

https://www.usherbrooke.ca/gelecinfo/fr/departement/profs/khoa-fr/khoa-en/

And notice that he is a full professor that is teaching a course

of operational research(that uses sophisticated mathematics),

and it is called the following:

"Performance analysis, probability and queuing, GIF 360"

I have discussed with him a lot on operational research, since i have also studied operational research, and here is some of my software projects of operational research:

About capacity planning and queuing theory..

I am a white arab and i think i am smart since i have also

invented many scalable algorithms and algorithms, and i have a university level Diploma in Microelectronics and informatics, and i am

a software developer, but i have also studied operational research..

And read my following thoughts about Operational research and some of my software projects in Operational research:

I have bought the following book called Performance by Design: Computer Capacity Planning By Example here:

https://www.amazon.ca/Performance-Design-Computer-Capacity-Planning/dp/0130906735

And the book is analyzing the performance of an E-Business Service with queuing theory, but i think its methodology is error-prone because it contains many mathematical calculations, so this is why i have decided to construct another methodology that is much less error-prone and that is easier and that uses the Jackson network , so my methodology works with 65% or more of database read transactions, the total of write and delete database transactions must be 25% or less, and it works too when it is 65% or above of database write database transactions, so i think my methodology is suitable to do capacity planning with mathematical queuing theory of E-Business Services, and i will write a book about it and explains my methodology, and of course i am taking care of the http or https overhead and i will provide you with a program too.

And here is my PDQ for Delphi and Freepascal

This is a port by Amine Moulay Ramdane of PDQ version 6.2.0 to Delphi on Windows and to Freepascal on both Windows and Linux, i have also provided you with two demos, one queuing MM1 demo, and another Jackson network demo. Also i have provided you with my html tutorial on how to solve analytically the Jackson network problem provided to you as a PDQ demo.

You can download it from my website here:

https://sites.google.com/site/scalable68/pdq-for-delphi-and-freepascal

PDQ is an analytic queueing-circuit analyzer made freely available under MIT/X11 license from www.perfdynamics.com

Read more about PDQ here:

http://www.perfdynamics.com/Tools/PDQ.html

And i have also implemented M/M/n queuing model simulation with Object Pascal, here it is:

https://sites.google.com/site/scalable68/m-m-n-queuing-model-simulation-with-object-pascal

I have also implemented Maxflow algorithm for Delphi and FreePascal, here it is:

https://sites.google.com/site/scalable68/maxflow-algorithm-for-delphi-and-freepascal

More philosophy about AMD Ryzen Threadripper PRO and Nvidia V100 PCIe (Volta)..

And I invite you to look at the following spec of AMD Ryzen Threadripper PRO 3975WX 32-Core CPU that i will buy in the next month:

https://www.techpowerup.com/cpu-specs/ryzen-threadripper-pro-3975wx.c2315

And look carefully at the following benchmark:

https://www.xcelerit.com/computing-benchmarks/insights/benchmarks-intel-xeon-scalable-processor-vs-nvidia-v100-gpu/

So as you are noticing that the spec of Nvidia V100 PCIe (Volta) 16 GB is 7,014 GFLOPs (double), and AMD Ryzen Threadripper PRO 3975WX 32-Core CPU is around 6,451.2 GFLOPS, but look carefully at the price of Nvidia V100 PCIe (Volta) 16 GB that is 7124 US dollars:

https://www.amazon.ca/PNY-TCSV100MPCIE-PB-Nvidia-Tesla-v100/dp/B076P84525/ref=pd_di_sccai_3?pd_rd_w=AmXIj&pf_rd_p=e92f388e-b766-4f7f-aac1-ee1d0056e8fb&pf_rd_r=77B7DWXEVBM5VSXT4NZG&pd_rd_r=27e26b6a-c0a6-4558-8a68-97e286ba6213&pd_rd_wg=HxaUi&pd_rd_i=B076P84525&psc=1

And look at the price of AMD Ryzen Threadripper PRO 3975WX 32-Core CPU

that is 2790 US dollars:

https://www.newegg.ca/amd-ryzen-threadripper-pro-3975wx/p/N82E16819113677

So i think that AMD Ryzen Threadripper PRO 3975WX 32-Core CPU is competitive in performance and price for the GFLOPS with Nvidia V100 PCIe (Volta) 16 GB.

I have just read the following interesting article about AVX512

On the dangers of Intel's frequency scaling

https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

So as you have just noticed by reading the above article that you have

not to use AVX512, because it heats a lot the CPU cores, so what Intel

is doing is to reduce a lot the speed of the CPU cores, but this is not good for performance. So what i advice is to avoid AVX2 or AVX512

and choose to use AVX that has not this problem. And AMD Ryzen Threadripper PRO 3975WX 32-Core that i will buy the next month also supports AVX2.

More about me and about fault-tolerant computer systems and more..

I am a white arab, and i think i am smart since i have also

invented many scalable algorithms and algorithms..

I have come to Canada when i was 20 years old, and i am living

in Canada Quebec for 32 years and now i am 52 years old , but i am genetically an athletic guy and i feel that i am still young because

i am more athletic and i am 6 Feet tall, and i am beautiful

from the inside since i am a gentleman type of person and it is also

genetical in me, and i have worked as a software consultant with

some hospitals in USA, and i have worked with some computer hardware companies and software companies in British Colombia and in New Brunswick in Canada, and here is more about my education and my Diploma and more:

My name is Amine Moulay Ramdane, i am a white arab from Morocco, and

i think i am smart since i have also invented many scalable algorithms

and algorithms, and i am a gentleman type of person, and i live in

Quebec Canada since year 1989, i am also a Canadian from Morocco, and

you have seen me writing my thoughts of my political philosophy here,

and now i will talk about my education and my Diploma: my Diploma is a

university level Diploma, my school in Morocco where i have studied and

gotten my university level Diploma in Microelectronics and informatics

was under the control of Paris Academie in France (we call it Académie

de Paris), and here it is:

https://translate.google.com/translate?hl=en&sl=auto&tl=en&u=https%3A%2F%2Ffr.wikipedia.org%2Fwiki%2FAcad%25C3%25A9mie_de_Paris

And i have continued to study one more year of applied mathematics in

university of Montreal in Quebec Canada, and i have succeeded this one

year in applied mathematics in university of Montreal, so with my

Diploma and this one year of applied mathematics i have studied and

succeeded 3 years at the university level, after that i have studied

Network administration and i have also worked as a network administrator

and as software developer consultant, the name of my company was and is

Cyber-NT Communications in Quebec Canada, and around years 2001 and

2002 i have started to implement some of my softwares like PerlZip that

looked like PkZip of PKware software company, but i have implemented it

for Perl , and i have implemented the Dynamic Link Libraries of my

PerlZip that permits to compress and decompress etc. with the

"Delphi"compiler, so my PerlZip software product was very fast

and very efficient, in year 2002 i have posted the Beta version on

internet, and as a proof , please read about it here:

http://computer-programming-forum.com/52-perl-modules/ea157f4a229fc720.htm

And after that i have sold the release version of my PerlZip

product to many many companies and to many individuals around the world,

and i have even sold it to many Banks in Europe, and with that i have

made more money.

And after that i have continued to work like a software developer

consultant and network administrator, the name of my company was and is

Cyber-NT Communications,

Here is my company in Quebec(Canada) called Cyber-NT Communications,

i have worked as a software developer and as a network administrator,

read the proof here:

https://opencorporates.com/companies/ca_qc/2246777231

Also read the following part of a somewhat old book of O'Reilly called

Perl for System Administration by David N. Blank-Edelman, and you will

notice that it contains my name and it speaks about some of my Perl modules:

https://www.oreilly.com/library/view/perl-for-system/1565926099/ch04s04.html

And you can find my Open source software projects here in my website:

https://sites.google.com/site/scalable68/

More philosophy about HP NonStop to x86 Server Platform fault-tolerant computer systems and more..

I am white arab and i think i am smart since i have also invented many scalable algorithms and algorithms..

Now HP to Extend HP NonStop to x86 Server Platform

HP announced in 2013 plans to extend its mission-critical HP NonStop technology to x86 server architecture, providing the 24/7 availability required in an always-on, globally connected world, and increasing customer choice.

Read the following to notice it:

https://www8.hp.com/us/en/hp-news/press-release.html?id=1519347#.YHSXT-hKiM8

And today HP provides HP NonStop to x86 Server Platform, and here is

an example, read here:

https://www.hpe.com/ca/en/pdfViewer.html?docId=4aa5-7443&parentPage=/ca/en/products/servers/mission-critical-servers/integrity-nonstop-systems&resourceTitle=HPE+NonStop+X+NS7+%E2%80%93+Redefining+continuous+availability+and+scalability+for+x86+data+sheet

So i think programming the HP NonStop for x86 is compatible with x86 CPU

architecture programming, so my following methodolody is working correctly, read it carefully since i have just extended my thoughts:

Here is my next powerful computer..

In the next month i will buy a powerful computer with the following powerful CPU:

AMD Ryzen Threadripper PRO 3975WX 32-Core 3.5 GHz

https://www.newegg.ca/amd-ryzen-threadripper-pro-3975wx/p/N82E16819113677

So my computer that i will buy in the next month will cost me around 9 thousands dollars, since i want to do some testing with the above CPU that comes with 32 cores and 8 memory channels, since i have invented many scalable algorithms and algorithms and i am writing two books about parallelism and concurrency that i will sell and i have invented some powerful tools for parallelism and concurrency that i will sell too etc.

So as you are noticing i am also buying a 3,499 US dollars CPU from USA

to make the USA economy works better.

Here is some benchmarks that shows a less powerful Threadripper 3970x AMD CPU with 4 channels of memory:

https://www.pugetsystems.com/labs/hpc/HPC-Parallel-Performance-for-3rd-gen-Threadripper-Xeon-3265W-and-EPYC-7742-HPL-HPCG-Numpy-NAMD-1717/

Also my next AMD Ryzen Threadripper PRO 3975WX 32-Core 3.5 GHz can be configured to work as 4 NUMA nodes, and the accessing time of far memory will be slower than accessing time of near memory by 1.6x times. So as you are noticing that my scalable algorithms such as my scalable MLock will work correctly, since what is important is scalability even if accessing time of far memory will be slower than accessing time of near memory by 1.6x times on my next AMD Ryzen Threadripper PRO 3975WX 32-Core 3.5 GHz.

About smartness and about MCS Lock and more..

I have just read the following article from ACM:

Scalability Techniques for Practical Synchronization Primitives

https://queue.acm.org/detail.cfm?id=2698990

Notice how they are speaking about one of the best scalable Lock that we call MCS lock, but i think that CLH and MCS locks are not smart since those scalable Locks are like intrusive, since they have to hide the required parameter to be passed, this is why i think i am smart since i have invented a scalable Lock that is better than MCS Lock since my scalable Lock doesn't require any parameter to be passed, just call the Enter() and Leave() methods and that's all, here it is, read carefully about it in my website here:

https://sites.google.com/site/scalable68/scalable-mlock

I have also just enhanced it more and i will post it soon.

I have also invented many other scalable algorithms and algorithms..

Here is some of them:

https://sites.google.com/site/scalable68/scalable-reference-counting-with-efficient-support-for-weak-references

https://sites.google.com/site/scalable68/scalable-rwlock

https://sites.google.com/site/scalable68/new-variants-of-scalable-rwlocks

https://groups.google.com/forum/#!topic/comp.programming.threads/VaOo1WVACgs

https://sites.google.com/site/scalable68/an-efficient-threadpool-engine-with-priorities-that-scales-very-well

Thank you,

Amine Moulay Ramdane.

Here is my just new invention of a scalable algorithm and my other new inventions..

I am a white arab, and i think i am smart since i have also

invented many scalable algorithms and algorithms..

I have just read the following PhD paper about the invention that we call counting networks and they are better than Software combining trees:

Counting Networks

http://people.csail.mit.edu/shanir/publications/AHS.pdf

And i have read the following PhD paper:

http://people.csail.mit.edu/shanir/publications/HLS.pdf

So as you are noticing they are saying in the conclusion that:

"Software combining trees and counting networks which are the only techniques we observed to be truly scalable"

But i just found that this counting networks algorithm is not generally scalable, and i have the logical proof here, this is why i have just come with a new invention that enhance the counting networks algorithm to be generally scalable. So you have to be careful with the actual counting networks algorithm that is not generally scalable.

More philosophy about my kind of works..

I just written the following:

--

More philosophy about my way of doing..

You have to know me more, since i have just posted about Computer Science vs Software Engineering, but i am not like

Computer Science or Software Engineering, because i am an inventor

of many software scalable algorithms and algorithms, and i have invented some powerful software tools, so my way of doing is being innovative and creative and inventive, so i am like a PhD researcher, and i am writing some books about my inventions and about my powerful tools etc.

--

I will give an example of how i am an inventive and creative, i have just read the following book (and of other books like it) of a PhD researcher about operational research and capacity planning, here they are:

Performance by Design: Computer Capacity Planning by Example

https://www.amazon.ca/Performance-Design-Computer-Capacity-Planning/dp/0130906735

So i have just found that there methodologies of those PhD researchers for the E-Business service don't work, because they are doing calculations for a given arrival rate that is statistically and empirically measured from the behavior of customers, but i think that it is not correct, so i am being inventive and i have come with my new methodology that fixes the arrival rate from the data by using an hyperexponential service distribution(and it is mathematical) since it is also good for Denial-of-Service (DoS) attacks and i will write a powerful book about it that will teach my new methodology and i will also explain the mathematics behind it and i will sell it, and my new methodology will work for cloud computing and for computer servers.

More about my inventions of scalable algorithms..

More precision about my new inventions of scalable algorithms..

And look at my below powerful inventions of LW_Fast_RWLockX and Fast_RWLockX that are two powerful scalable RWLocks that are FIFO fair

and Starvation-free and costless on the reader side

(that means with no atomics and with no fences on the reader side), they use sys_membarrier expedited on Linux and FlushProcessWriteBuffers() on windows, and if you look at the source code of my LW_Fast_RWLockX.pas

and Fast_RWLockX.pas inside the zip file, you will notice that in Linux they call two functions that are membarrier1() and membarrier2(), the membarrier1() registers the process's intent to use MEMBARRIER_CMD_PRIVATE_EXPEDITED and membarrier2() executes a memory barrier on each running thread belonging to the same process as the calling thread.

Read more here to understand:

https://man7.org/linux/man-pages/man2/membarrier.2.html

Here is my new powerful inventions of scalable algorithms..

I have just updated my powerful inventions of LW_Fast_RWLockX and Fast_RWLockX that are two powerful scalable RWLocks that are FIFO fair

and Starvation-free and costless on the reader side (that means with no atomics and with no fences on the reader side), they use sys_membarrier expedited on Linux and FlushProcessWriteBuffers() on windows, and now they work with both Linux and Windows, and i think my inventions are really smart, since read the following PhD researcher,

he says the following:

"Until today, there is no known efficient reader-writer lock with starvation-freedom guarantees;"

Read more here:

http://concurrencyfreaks.blogspot.com/2019/04/onefile-and-tail-latency.html

So as you have just noticed he says the following:

"Until today, there is no known efficient reader-writer lock with starvation-freedom guarantees;"

So i think that my above powerful inventions of scalable reader-writer locks are efficient and FIFO fair and Starvation-free.

LW_Fast_RWLockX that is a lightweight scalable Reader-Writer Mutex that uses a technic that looks like Seqlock without looping on the reader side like Seqlock, and this has permitted the reader side to be costless, it is fair and it is of course Starvation-free and it does spin-wait, and also Fast_RWLockX a lightweight scalable Reader-Writer Mutex that uses a technic that looks like Seqlock without looping on the reader side like Seqlock, and this has permitted the reader side to be costless, it is fair and it is of course Starvation-free and it does not spin-wait, but waits on my SemaMonitor, so it is energy efficient.

You can read about them and download them from my website here:

https://sites.google.com/site/scalable68/scalable-rwlock

About the Linux sys_membarrier() expedited and the windows FlushProcessWriteBuffers()..

I have just read the following webpage:

https://lwn.net/Articles/636878/

And it is interesting and it says:

---

Results in liburcu:

Operations in 10s, 6 readers, 2 writers:

memory barriers in reader: 1701557485 reads, 3129842 writes

signal-based scheme: 9825306874 reads, 5386 writes

sys_membarrier expedited: 6637539697 reads, 852129 writes

sys_membarrier non-expedited: 7992076602 reads, 220 writes

---

Look at how "sys_membarrier expedited" is powerful.

Cache-coherency protocols do not use IPIs, and as a user-space level developer you do not care about IPIs at all. One is most interested in the cost of cache-coherency itself. However, Win32 API provides a function that issues IPIs to all processors (in the affinity mask of the current process) FlushProcessWriteBuffers(). You can use it to investigate the cost of IPIs.

When i do simple synthetic test on a dual core machine I've obtained following numbers.

420 cycles is the minimum cost of the FlushProcessWriteBuffers() function on issuing core.

1600 cycles is mean cost of the FlushProcessWriteBuffers() function on issuing core.

1300 cycles is mean cost of the FlushProcessWriteBuffers() function on remote core.

Note that, as far as I understand, the function issues IPI to remote core, then remote core acks it with another IPI, issuing core waits for ack IPI and then returns.

And the IPIs have indirect cost of flushing the processor pipeline.

More about WaitAny() and WaitAll() and more..

Look at the following concurrency abstractions of Microsoft:

https://docs.microsoft.com/en-us/dotnet/api/system.threading.tasks.task.waitany?view=netframework-4.8

https://docs.microsoft.com/en-us/dotnet/api/system.threading.tasks.task.waitall?view=netframework-4.8

They look like the following WaitForAny() and WaitForAll() of Delphi, here they are:

http://docwiki.embarcadero.com/Libraries/Sydney/en/System.Threading.TTask.WaitForAny

http://docwiki.embarcadero.com/Libraries/Sydney/en/System.Threading.TTask.WaitForAll

So the WaitForAll() is easy and i have implemented it in my Threadpool engine that scales very well and that i have invented, you can read my html tutorial inside The zip file of it to know how to do it, you can download it from my website here:

https://sites.google.com/site/scalable68/an-efficient-threadpool-engine-with-priorities-that-scales-very-well

And about the WaitForAny(), you can also do it using my SemaMonitor,

and i will soon give you an example of how to do it, and you can download my SemaMonitor invention from my website here:

https://sites.google.com/site/scalable68/semacondvar-semamonitor

Here is my other just new software inventions..

I have just looked at the source code of the following multiplatform pevents

https://github.com/neosmart/pevents

And notice that the WaitForMultipleEvents() is implemented with pthread

but it is not scalable on multicores. So i have just invented a WaitForMultipleObjects() that looks like the Windows WaitForMultipleObjects() and that is fully "scalable" on multicores and that works on Windows and Linux and MacOSX and that is blocking when waiting for the objects as WaitForMultipleObjects(), so it doesn't consume CPU cycles when waiting and it works with events and futures and tasks.

Here is my other just new software inventions..

I have just invented a fully "scalable" on multicores latch and a

fully scalable on multicores thread barrier, they are really powerful.

Read about the latches and thread barriers that are not scalable on

multicores of C++ here:

https://www.modernescpp.com/index.php/latches-and-barriers

Here is my other software inventions:

More about my scalable math Linear System Solver Library...

As you have just noticed i have just spoken about my Linear System Solver Library(read below), right now it scales very well, but i will

soon make it "fully" scalable on multicores using one of my scalable algorithm that i have invented and i will extend it much more to also support efficient scalable on multicores matrix operations and more, and since it will come with one of my scalable algorithms that i have invented, i think i will sell it too.

More about mathematics and about scalable Linear System Solver Libraries and more..

I have just noticed that a software architect from Austria

called Michael Rabatscher has designed and implemented MrMath Library that is also a parallelized Library:

Here he is:

https://at.linkedin.com/in/michael-rabatscher-6821702b

And here is his MrMath Library for Delphi and Freepascal:

https://github.com/mikerabat/mrmath

But i think that he is not so smart, and i think i am smart like

a genius and i say that his MrMath Library is not scalable on multicores, and notice that the Linear System Solver of his MrMath Library is not scalable on multicores too, and notice that the threaded matrix operations of his Library are not scalable on multicores too, this is why i have invented a scalable on multicores Conjugate Gradient Linear System Solver Library for C++ and Delphi and Freepascal, and here it is, read about it in my following thoughts(also i will soon extend more my Library to support scalable matrix operations):

About SOR and Conjugate gradient mathematical methods..

I have just looked at SOR(Successive Overrelaxation Method),

and i think it is much less powerful than Conjugate gradient method,

read the following to notice it:

COMPARATIVE PERFORMANCE OF THE CONJUGATE GRADIENT AND SOR METHODS

FOR COMPUTATIONAL THERMAL HYDRAULICS

https://inis.iaea.org/collection/NCLCollectionStore/_Public/19/055/19055644.pdf?r=1&r=1

This is why i have implemented in both C++ and Delphi my Parallel Conjugate Gradient Linear System Solver Library that scales very well, read my following thoughts about it to understand more:

About the convergence properties of the conjugate gradient method

The conjugate gradient method can theoretically be viewed as a direct method, as it produces the exact solution after a finite number of iterations, which is not larger than the size of the matrix, in the absence of round-off error. However, the conjugate gradient method is unstable with respect to even small perturbations, e.g., most directions are not in practice conjugate, and the exact solution is never obtained. Fortunately, the conjugate gradient method can be used as an iterative method as it provides monotonically improving approximations to the exact solution, which may reach the required tolerance after a relatively small (compared to the problem size) number of iterations. The improvement is typically linear and its speed is determined by the condition number κ(A) of the system matrix A: the larger is κ(A), the slower the improvement.

Read more here:

http://pages.stat.wisc.edu/~wahba/stat860public/pdf1/cj.pdf

So i think my Conjugate Gradient Linear System Solver Library

that scales very well is still very useful, read about it

in my writing below:

Read the following interesting news:

The finite element method finds its place in games

Read more here:

https://translate.google.com/translate?hl=en&sl=auto&tl=en&u=https%3A%2F%2Fhpc.developpez.com%2Factu%2F288260%2FLa-methode-des-elements-finis-trouve-sa-place-dans-les-jeux-AMD-propose-la-bibliotheque-FEMFX-pour-une-simulation-en-temps-reel-des-deformations%2F

But you have to be aware that finite element method uses Conjugate Gradient Method for Solution of Finite Element Problems, read here to notice it:

Conjugate Gradient Method for Solution of Large Finite Element Problems on CPU and GPU

https://pdfs.semanticscholar.org/1f4c/f080ee622aa02623b35eda947fbc169b199d.pdf

This is why i have also designed and implemented my Parallel Conjugate Gradient Linear System Solver library that scales very well,

here it is:

My Parallel C++ Conjugate Gradient Linear System Solver Library

that scales very well version 1.76 is here..

Author: Amine Moulay Ramdane

Description:

This library contains a Parallel implementation of Conjugate Gradient Dense Linear System Solver library that is NUMA-aware and cache-aware that scales very well, and it contains also a Parallel implementation of Conjugate Gradient Sparse Linear System Solver library that is cache-aware that scales very well.

Sparse linear system solvers are ubiquitous in high performance computing (HPC) and often are the most computational intensive parts in scientific computing codes. A few of the many applications relying on sparse linear solvers include fusion energy simulation, space weather simulation, climate modeling, and environmental modeling, and finite element method, and large-scale reservoir simulations to enhance oil recovery by the oil and gas industry.

Conjugate Gradient is known to converge to the exact solution in n steps for a matrix of size n, and was historically first seen as a direct method because of this. However, after a while people figured out that it works really well if you just stop the iteration much earlier - often you will get a very good approximation after much fewer than n steps. In fact, we can analyze how fast Conjugate gradient converges. The end result is that Conjugate gradient is used as an iterative method for large linear systems today.

Please download the zip file and read the readme file inside the zip to know how to use it.

You can download it from:

https://sites.google.com/site/scalable68/scalable-parallel-c-conjugate-gradient-linear-system-solver-library

Language: GNU C++ and Visual C++ and C++Builder

Operating Systems: Windows, Linux, Unix and Mac OS X on (x86)

--

Thread Barrier for Delphi and Freepascal version 1.0 is here..

I have added my condition variable implementation and my scalable Lock called scalable MLock that both work with both Windows and Linux and i have made the Thread Barrier work with both Windows and Linux, and now you can pass a parameter to the constructor of the Thread Barrier as ctMutex to use a Mutex or ctMLock to use a scalable Lock called MLock or ctCriticalSection to use a Critical Section.

You can download it from my website here:

https://sites.google.com/site/scalable68/thread-barrier-for-delphi-and-freepascal

Yet more precision about my inventions that are my SemaMonitor and SemaCondvar and my Monitor..

My inventions that are my SemaMonitor and SemaCondvar are fast pathed when the count of my SemaMonitor or my SemaCondvar is greater than 0, so in this case the wait() method stays on the user mode and it doesn't switch from user mode to kernel mode that costs around 1500 CPU cycles and that is expensive, the signal() method is also fast pathed when there is no item in the queue and count is less than MaximumCount, read here about what is the cost (in CPU cycles) to switch between windows user mode and kernel mode:

https://stackoverflow.com/questions/1368061/whats-the-cost-in-cycles-to-switch-between-windows-kernel-and-user-mode#:~:text=1%20Answer&text=Switching%20from%20%E2%80%9Cuser%20mode%E2%80%9D%20to,rest%20is%20%22kernel%20overhead%22.

You can read about and download my inventions of SemaMonitor and SemaCondvar from here:

https://sites.google.com/site/scalable68/semacondvar-semamonitor

And the light weight version is here:

https://sites.google.com/site/scalable68/light-weight-semacondvar-semamonitor

And i have implemented an efficient Monitor over my SemaCondvar.

Here is the description of my efficient Monitor inside the Monitor.pas file that you will find inside the zip file:

Description:

This is my implementation of a Monitor over my SemaCondvar.

You will find the Monitor class inside the Monitor.pas file inside the zip file.

When you set the first parameter of the constructor to true, the signal will not be lost if the threads are not waiting with wait() method, but when you set the first parameter of the construtor to false, if the threads are not waiting with the wait() method, the signal will be lost..

Second parameter of the constructor is the kind of Lock, you can set it to ctMLock to use my scalable node based lock called MLock, or you can set it to ctMutex to use a Mutex or you can set it to ctCriticalSection to use the TCriticalSection.

Here is the methods of my efficient Monitor that i have implemented:

TMonitor = class

private

cache0:typecache0;

lock1:TSyncLock;

obj:TSemaCondvar;

cache1:typecache0;

public

constructor Create(bool:boolean=true;lock:TMyLocks=ctMLock);

destructor Destroy; override;

procedure Enter();

procedure Leave();

function Signal():boolean;overload;

function Signal(nbr:long;var remains:long):boolean;overload;

procedure Signal_All();

function Wait(const AMilliseconds:longword=INFINITE): boolean;

function WaitersBlocked():long;

end;

The wait() method is for the threads to wait on the Monitor object for

the signal to be signaled. If wait() fails, that can be that the number

of waiters is greater than high(longword).

And the signal() method will signal one time a waiting thread on the

Monitor object, but if signal() fails , the returned value is false.

the signal_all() method will signal all the waiting threads on

the Monitor object.

The signal(nbr:long;var remains:long) method will signal nbr of

waiting threads, but if signal() fails, the remaining number of signals

that were not signaled will be returned in the remains variable.

and WaitersBlocked() will return the number of waiting threads on

the Monitor object.

and Enter() and Leave() methods to enter and leave the monitor's Lock.

You can download the zip files from:

https://sites.google.com/site/scalable68/semacondvar-semamonitor

and the lightweight version is here:

https://sites.google.com/site/scalable68/light-weight-semacondvar-semamonitor

More about my powerful inventions of scalable reference counting algorithm and of my scalable algorithms..

I invite you to read the following web page:

Why is memory reclamation so important?

https://concurrencyfreaks.blogspot.com/search?q=resilience+and+urcu

Notice that it is saying the following about RCU:

"Reason number 4, resilience

Another reason to go with lock-free/wait-free data structures is because they are resilient to failures. On a shared memory system with multiples processes accessing the same data structure, even if one of the processes dies, the others will be able to progress in their work. This is the true gem of lock-free data structures: progress in the presence of failure. Blocking data structures (typically) do not have this property (there are exceptions though). If we add a blocking memory reclamation (like URCU) to a lock-free/wait-free data structure, we are loosing this resilience because one dead process will prevent further memory reclamation and eventually bring down the whole system.

There goes the resilience advantage out the window."

So i think that RCU can not be used as reference counting,

since it is blocking on the writer side, so it is not resilient to failures since it is not lock-free on the writer side.

So this is why i have invented my powerful Scalable reference counting with efficient support for weak references that is lock-free for its scalable reference counting, and here it is:

https://sites.google.com/site/scalable68/scalable-reference-counting-with-efficient-support-for-weak-references

And my scalable reference counting algorithm is of the SCU(0,1) Class of Algorithms, so under scheduling conditions which approximate those found in commercial hardware architectures, it becomes wait-free with a system latency of time O(sqrt(k)) and with an individual latency of O(k*sqrt(k)), and k number of threads.

The proof is here on the following PhD paper:

https://arxiv.org/pdf/1311.3200.pdf

This paper suggests a simple solution to this problem. We show that, for a large class of lock- free algorithms, under scheduling conditions which approximate those found in commercial hardware architectures, lock-free algorithms behave as if they are wait-free. In other words, programmers can keep on designing simple lock-free algorithms instead of complex wait-free ones, and in practice, they will get wait-free progress. It says on the Analysis of the Class SCU(q, s):

"Given an algorithm in SCU(q, s) on k correct processes under a uniform stochastic scheduler, the system latency is O(q + s*sqrt(k), and the individual latency is O(k(q + s*sqrt(k))."

More precision about my new inventions of scalable algorithms..

And look at my below powerful inventions of LW_Fast_RWLockX and Fast_RWLockX that are two powerful scalable RWLocks that are FIFO fair

and Starvation-free and costless on the reader side

(that means with no atomics and with no fences on the reader side), they use sys_membarrier expedited on Linux and FlushProcessWriteBuffers() on windows, and if you look at the source code of my LW_Fast_RWLockX.pas

and Fast_RWLockX.pas inside the zip file, you will notice that in Linux they call two functions that are membarrier1() and membarrier2(), the membarrier1() registers the process's intent to use MEMBARRIER_CMD_PRIVATE_EXPEDITED and membarrier2() executes a memory barrier on each running thread belonging to the same process as the calling thread.

Read more here to understand:

https://man7.org/linux/man-pages/man2/membarrier.2.html

Here is my new powerful inventions of scalable algorithms..

I have just updated my powerful inventions of LW_Fast_RWLockX and Fast_RWLockX that are two powerful scalable RWLocks that are FIFO fair

and Starvation-free and costless on the reader side (that means with no atomics and with no fences on the reader side), they use sys_membarrier expedited on Linux and FlushProcessWriteBuffers() on windows, and now they work with both Linux and Windows, and i think my inventions are really smart, since read the following PhD researcher,

he says the following:

"Until today, there is no known efficient reader-writer lock with starvation-freedom guarantees;"

Read more here:

http://concurrencyfreaks.blogspot.com/2019/04/onefile-and-tail-latency.html

So as you have just noticed he says the following:

"Until today, there is no known efficient reader-writer lock with starvation-freedom guarantees;"

So i think that my above powerful inventions of scalable reader-writer locks are efficient and FIFO fair and Starvation-free.

LW_Fast_RWLockX that is a lightweight scalable Reader-Writer Mutex that uses a technic that looks like Seqlock without looping on the reader side like Seqlock, and this has permitted the reader side to be costless, it is fair and it is of course Starvation-free and it does spin-wait, and also Fast_RWLockX a lightweight scalable Reader-Writer Mutex that uses a technic that looks like Seqlock without looping on the reader side like Seqlock, and this has permitted the reader side to be costless, it is fair and it is of course Starvation-free and it does not spin-wait, but waits on my SemaMonitor, so it is energy efficient.

You can read about them and download them from my website here:

https://sites.google.com/site/scalable68/scalable-rwlock

Also my other inventions are the following scalable RWLocks that are

FIFO fair and starvation-free:

Here is my invention of a scalable and starvation-free and FIFO fair and lightweight Multiple-Readers-Exclusive-Writer Lock called LW_RWLockX, it works across processes and threads:

https://sites.google.com/site/scalable68/scalable-rwlock-that-works-accross-processes-and-threads

And here is my inventions of New variants of Scalable RWLocks that are FIFO fair and Starvation-free:

https://sites.google.com/site/scalable68/new-variants-of-scalable-rwlocks

More about the energy efficiency of Transactional memory and more..

I have just read the following PhD paper, it is also about energy efficiency of Transactional memory, here it is:

Techniques for Enhancing the Efficiency of Transactional Memory Systems

http://kth.diva-portal.org/smash/get/diva2:1258335/FULLTEXT02.pdf

And i think it is the best known energy efficient algorithm for

Transactional memory, but i think it is not good, since

look at how for 64 cores the Beta parameter can be 16 cores,

so i think i am smart and i have just invented a much more energy efficient and powerful scalable fast Mutex and i have also just invented scalable RWLocks that are starvation-free and fair, read about them in my below writing and thoughts:

More about deadlocks and lock-based systems and more..

I have just read the following from an software engineer from Quebec Canada:

A deadlock-detecting mutex

https://faouellet.github.io/ddmutex/

And i have just understood rapidly his algorithm, but i think

his algorithm is not efficient at all, since we can find

if a graph has a strongly connected component in around a time complexity O(V+E), so then the algorithm above of the engineer from Quebec Canada takes around a time complexity of O(n*(V+E)), so it is not good.

So a much better way is to use my following way of detecting deadlocks:

DelphiConcurrent and FreepascalConcurrent are here

Read more here in my website:

https://sites.google.com/site/scalable68/delphiconcurrent-and-freepascalconcurrent

And i will soon enhance much more DelphiConcurrent and FreepascalConcurrent to support both Communication deadlocks

and Resource deadlocks.

About Transactional memory and locks..

I have just read the following paper about Transactional memory and locks:

http://sunnydhillon.net/assets/docs/concurrency-tm.pdf

I don't agree with the above paper, since read my following thoughts

to understand:

I have just invented a new powerful scalable fast mutex, and it has the following characteristics:

1- Starvation-free

2- Tunable fairness

3- It keeps efficiently and very low its cache coherence traffic

4- Very good fast path performance

5- And it has a good preemption tolerance.

6- It is faster than scalable MCS lock

7- It solves the problem of lock convoying

So my new invention also solves the following problem:

The convoy phenomenon

https://blog.acolyer.org/2019/07/01/the-convoy-phenomenon/

And here is my other new invention of a Scalable RWLock that works across processes and threads that is starvation-free and fair and i will soon enhance it much more and it will become really powerful:

https://sites.google.com/site/scalable68/scalable-rwlock-that-works-accross-processes-and-threads

And about Lock-free versus Lock, read my following post:

https://groups.google.com/forum/#!topic/comp.programming.threads/F_cF4ft1Qic

And about deadlocks, here is also how i have solved it, and i will soon enhance much more DelphiConcurrent and FreepacalConcurrent:

DelphiConcurrent and FreepascalConcurrent are here

Read more here in my website:

https://sites.google.com/site/scalable68/delphiconcurrent-and-freepascalconcurrent

So i think with my above scalable fast mutex and my scalable RWLocks

that are starvation-free and fair and by reading the following about composability of lock-based systems, you will notice that lock-based systems are still useful.

"About composability of lock-based systems..

Design your systems to be composable. Among the more galling claims of

the detractors of lock-based systems is the notion that they are somehow

uncomposable: “Locks and condition variables do not support modular

programming,” reads one typically brazen claim, “building large programs

by gluing together smaller programs[:] locks make this impossible.”9 The

claim, of course, is incorrect. For evidence one need only point at the

composition of lock-based systems such as databases and operating

systems into larger systems that remain entirely unaware of lower-level

locking.

There are two ways to make lock-based systems completely composable, and

each has its own place. First (and most obviously), one can make locking

entirely internal to the subsystem. For example, in concurrent operating

systems, control never returns to user level with in-kernel locks held;

the locks used to implement the system itself are entirely behind the

system call interface that constitutes the interface to the system. More

generally, this model can work whenever a crisp interface exists between

software components: as long as control flow is never returned to the

caller with locks held, the subsystem will remain composable.

Second (and perhaps counterintuitively), one can achieve concurrency and

composability by having no locks whatsoever. In this case, there must be

no global subsystem state—subsystem state must be captured in

per-instance state, and it must be up to consumers of the subsystem to

assure that they do not access their instance in parallel. By leaving

locking up to the client of the subsystem, the subsystem itself can be

used concurrently by different subsystems and in different contexts. A

concrete example of this is the AVL tree implementation used extensively

in the Solaris kernel. As with any balanced binary tree, the

implementation is sufficiently complex to merit componentization, but by

not having any global state, the implementation may be used concurrently

by disjoint subsystems—the only constraint is that manipulation of a

single AVL tree instance must be serialized."

Read more here:

https://queue.acm.org/detail.cfm?id=1454462

About mathematics and about abstraction..

I think my specialization is also that i have invented many software algorithms and software scalable algorithms and i am still inventing other software scalable algorithms and algorithms, those scalable algorithms and algorithms that i have invented are like inventing mathematical theorems that you prove and present in a higher level abstraction, but not only that but those algorithms and scalable algorithms of mine are presented in a form of higher level software abstraction that abstract the complexity of my scalable algorithms and algorithms, it is the most important part that interests me, for example notice how i am constructing higher level abstraction in my following tutorial as methodology that, first, permits to model the synchronization objects of parallel programs with logic primitives with If-Then-OR-AND so that to make it easy to translate to Petri nets so that to detect deadlocks in parallel programs, please take a look at it in my following web link because this tutorial of mine is the way of learning by higher level abstraction:

How to analyse parallel applications with Petri Nets

https://sites.google.com/site/scalable68/how-to-analyse-parallel-applications-with-petri-nets

So notice that my methodology is a generalization that solves communication deadlocks and resource deadlocks in parallel programs.

1- Communication deadlocks that result from incorrect use of

event objects or condition variables (i.e. wait-notify

synchronization).

2- Resource deadlocks, a common kind of deadlock in which a set of

threads blocks forever because each thread in the set is waiting to

acquire a lock held by another thread in the set.

This is what interests me in mathematics, i want to work efficiently in mathematics in a much higher level of abstraction, i give you

an example of what i am doing in mathematics so that you understand,

look at how i am implementing mathematics as a software parallel conjugate gradient system solvers that scale very well, and i am presenting them in a higher level of abstraction, this is how i am abstracting the mathematics of them, read the following about it to notice it:

About SOR and Conjugate gradient mathematical methods..

I have just looked at SOR(Successive Overrelaxation Method),

and i think it is much less powerful than Conjugate gradient method,

read the following to notice it:

COMPARATIVE PERFORMANCE OF THE CONJUGATE GRADIENT AND SOR METHODS

FOR COMPUTATIONAL THERMAL HYDRAULICS

https://inis.iaea.org/collection/NCLCollectionStore/_Public/19/055/19055644.pdf?r=1&r=1

This is why i have implemented in both C++ and Delphi my Parallel Conjugate Gradient Linear System Solver Library that scales very well, read my following thoughts about it to understand more:

About the convergence properties of the conjugate gradient method

The conjugate gradient method can theoretically be viewed as a direct method, as it produces the exact solution after a finite number of iterations, which is not larger than the size of the matrix, in the absence of round-off error. However, the conjugate gradient method is unstable with respect to even small perturbations, e.g., most directions are not in practice conjugate, and the exact solution is never obtained. Fortunately, the conjugate gradient method can be used as an iterative method as it provides monotonically improving approximations to the exact solution, which may reach the required tolerance after a relatively small (compared to the problem size) number of iterations. The improvement is typically linear and its speed is determined by the condition number κ(A) of the system matrix A: the larger is κ(A), the slower the improvement.

Read more here:

http://pages.stat.wisc.edu/~wahba/stat860public/pdf1/cj.pdf

So i think my Conjugate Gradient Linear System Solver Library

that scales very well is still very useful, read about it

in my writing below:

Read the following interesting news:

The finite element method finds its place in games

Read more here:

https://translate.google.com/translate?hl=en&sl=auto&tl=en&u=https%3A%2F%2Fhpc.developpez.com%2Factu%2F288260%2FLa-methode-des-elements-finis-trouve-sa-place-dans-les-jeux-AMD-propose-la-bibliotheque-FEMFX-pour-une-simulation-en-temps-reel-des-deformations%2F

But you have to be aware that finite element method uses Conjugate Gradient Method for Solution of Finite Element Problems, read here to notice it:

Conjugate Gradient Method for Solution of Large Finite Element Problems on CPU and GPU

https://pdfs.semanticscholar.org/1f4c/f080ee622aa02623b35eda947fbc169b199d.pdf

This is why i have also designed and implemented my Parallel Conjugate Gradient Linear System Solver library that scales very well,

here it is:

My Parallel C++ Conjugate Gradient Linear System Solver Library

that scales very well version 1.76 is here..

Author: Amine Moulay Ramdane

Description:

This library contains a Parallel implementation of Conjugate Gradient Dense Linear System Solver library that is NUMA-aware and cache-aware that scales very well, and it contains also a Parallel implementation of Conjugate Gradient Sparse Linear System Solver library that is cache-aware that scales very well.

Sparse linear system solvers are ubiquitous in high performance computing (HPC) and often are the most computational intensive parts in scientific computing codes. A few of the many applications relying on sparse linear solvers include fusion energy simulation, space weather simulation, climate modeling, and environmental modeling, and finite element method, and large-scale reservoir simulations to enhance oil recovery by the oil and gas industry.

Conjugate Gradient is known to converge to the exact solution in n steps for a matrix of size n, and was historically first seen as a direct method because of this. However, after a while people figured out that it works really well if you just stop the iteration much earlier - often you will get a very good approximation after much fewer than n steps. In fact, we can analyze how fast Conjugate gradient converges. The end result is that Conjugate gradient is used as an iterative method for large linear systems today.

Please download the zip file and read the readme file inside the zip to know how to use it.

You can download it from:

https://sites.google.com/site/scalable68/scalable-parallel-c-conjugate-gradient-linear-system-solver-library

Language: GNU C++ and Visual C++ and C++Builder

Operating Systems: Windows, Linux, Unix and Mac OS X on (x86)

--

As you have noticed i have just written above about my Parallel C++ Conjugate Gradient Linear System Solver Library that scales very well, but here is my Parallel Delphi and Freepascal Conjugate Gradient Linear System Solvers Libraries that scale very well:

Parallel implementation of Conjugate Gradient Dense Linear System solver library that is NUMA-aware and cache-aware that scales very well

https://sites.google.com/site/scalable68/scalable-parallel-implementation-of-conjugate-gradient-dense-linear-system-solver-library-that-is-numa-aware-and-cache-aware

PARALLEL IMPLEMENTATION OF CONJUGATE GRADIENT SPARSE LINEAR SYSTEM SOLVER LIBRARY THAT SCALES VERY WELL

https://sites.google.com/site/scalable68/scalable-parallel-implementation-of-conjugate-gradient-sparse-linear-system-solver

More of my philosophy about Unix and Linux and more..

I am a white arab and i think i am smart since i have also invented

many scalable algorithms and algorithms..

I invite you to look at the following interesting video:

Unix vs Linux

https://www.youtube.com/watch?v=jowCUo_UGts

My Diploma is a university level Diploma, my school in Morocco where i have studied and gotten my university level Diploma in Microelectronics and informatics was under the control of Paris Academie in France (we call it Académie de Paris), and here it is:

https://translate.google.com/translate?hl=en&sl=auto&tl=en&u=https%3A%2F%2Ffr.wikipedia.org%2Fwiki%2FAcad%25C3%25A9mie_de_Paris

I have started my studies in Microelectronics and informatics in 1986,

and in my studies of informatics in my university level school i have programmed and worked with the following computer that was called Altos 586, it was a Unix system, here it is:

https://en.wikipedia.org/wiki/Altos_586

And i have gotten my university level Diploma in Microelectronics and informatics in 1989.

So as you notice that i know how to program in Unix and in Linux and in Windows too, and as a proof here is my new scalable algorithm invention that i have also ported to Windows and Linux(and you can download the zip file from my website and take a look at source code of it):

https://sites.google.com/site/scalable68/new-variants-of-scalable-rwlocks

And you can look at my other Open source software projects here in

my website:

https://sites.google.com/site/scalable68/

And my today new software invention is the following:

You have to know that a Turing-complete system can be proven mathematically to be capable of performing any possible calculation or computer program, and bash shell for Linux and Windows are Turing-complete, and even if bash shell is not python, it is a minimalist language that is especially designed for administrators of operating systems, but i have noticed that bash shell is not suited for for parallel programming, this is why i am enhancing it with my

scalable algorithms so that to support sophisticated parallel programming on both Linux and Windows that permits it to scale much better on RAIDs and on multicores. So i am also writing a book about my enhancement to bash shell with my scalable algorithms so that to help others be efficient in bash shell programming and efficient in operating system administration, and of course i will sell my book, so i don't think you need python since python doesn't come with my scalable algorithms that will enhance bash for Linux and Windows, and i think operating systems administrators don't need python since it is

not suited for operating system administrators since it is

not a minimalist language as bash for Linux and Windows.

You can read more about bash shell from here:

https://www.infoworld.com/article/2893519/perl-python-ruby-are-nice-bash-is-where-its-at.html

More philosophy about my kind of interests and more about me..

More philosophy of what kind of friends have i ?

I look like a PhD researcher since i am an inventor and i have invented many scalable algorithms and algorithms, and i am still inventing algorithms, and this is why my friends are like PhD researchers, here is one of my friends that is a PhD researcher and Full Professor, he is one of my best friend, i know him for around 23 years:

https://www.usherbrooke.ca/gelecinfo/fr/departement/profs/khoa-fr/khoa-en/

So notice carefully his webpage:

https://www.usherbrooke.ca/gelecinfo/fr/departement/profs/khoa-fr/khoa-en/

And notice that he is a full professor that is teaching a course

of operational research(that uses sophisticated mathematics),

and it is called the following:

"Performance analysis, probability and queuing, GIF 360"

I have discussed with him a lot on operational research, since i have also studied operational research, and here is some of my software projects of operational research:

About capacity planning and queuing theory..

I am a white arab and i think i am smart since i have also

invented many scalable algorithms and algorithms, and i have a university level Diploma in Microelectronics and informatics, and i am

a software developer, but i have also studied operational research..

And read my following thoughts about Operational research and some of my software projects in Operational research:

I have bought the following book called Performance by Design: Computer Capacity Planning By Example here:

https://www.amazon.ca/Performance-Design-Computer-Capacity-Planning/dp/0130906735

And the book is analyzing the performance of an E-Business Service with queuing theory, but i think its methodology is error-prone because it contains many mathematical calculations, so this is why i have decided to construct another methodology that is much less error-prone and that is easier and that uses the Jackson network , so my methodology works with 65% or more of database read transactions, the total of write and delete database transactions must be 25% or less, and it works too when it is 65% or above of database write database transactions, so i think my methodology is suitable to do capacity planning with mathematical queuing theory of E-Business Services, and i will write a book about it and explains my methodology, and of course i am taking care of the http or https overhead and i will provide you with a program too.

And here is my PDQ for Delphi and Freepascal

This is a port by Amine Moulay Ramdane of PDQ version 6.2.0 to Delphi on Windows and to Freepascal on both Windows and Linux, i have also provided you with two demos, one queuing MM1 demo, and another Jackson network demo. Also i have provided you with my html tutorial on how to solve analytically the Jackson network problem provided to you as a PDQ demo.

You can download it from my website here:

https://sites.google.com/site/scalable68/pdq-for-delphi-and-freepascal

PDQ is an analytic queueing-circuit analyzer made freely available under MIT/X11 license from www.perfdynamics.com

Read more about PDQ here:

http://www.perfdynamics.com/Tools/PDQ.html

And i have also implemented M/M/n queuing model simulation with Object Pascal, here it is:

https://sites.google.com/site/scalable68/m-m-n-queuing-model-simulation-with-object-pascal

I have also implemented Maxflow algorithm for Delphi and FreePascal, here it is:

https://sites.google.com/site/scalable68/maxflow-algorithm-for-delphi-and-freepascal

More philosophy about AMD Ryzen Threadripper PRO and Nvidia V100 PCIe (Volta)..

And I invite you to look at the following spec of AMD Ryzen Threadripper PRO 3975WX 32-Core CPU that i will buy in the next month:

https://www.techpowerup.com/cpu-specs/ryzen-threadripper-pro-3975wx.c2315

And look carefully at the following benchmark:

https://www.xcelerit.com/computing-benchmarks/insights/benchmarks-intel-xeon-scalable-processor-vs-nvidia-v100-gpu/

So as you are noticing that the spec of Nvidia V100 PCIe (Volta) 16 GB is 7,014 GFLOPs (double), and AMD Ryzen Threadripper PRO 3975WX 32-Core CPU is around 6,451.2 GFLOPS, but look carefully at the price of Nvidia V100 PCIe (Volta) 16 GB that is 7124 US dollars:

https://www.amazon.ca/PNY-TCSV100MPCIE-PB-Nvidia-Tesla-v100/dp/B076P84525/ref=pd_di_sccai_3?pd_rd_w=AmXIj&pf_rd_p=e92f388e-b766-4f7f-aac1-ee1d0056e8fb&pf_rd_r=77B7DWXEVBM5VSXT4NZG&pd_rd_r=27e26b6a-c0a6-4558-8a68-97e286ba6213&pd_rd_wg=HxaUi&pd_rd_i=B076P84525&psc=1

And look at the price of AMD Ryzen Threadripper PRO 3975WX 32-Core CPU

that is 2790 US dollars:

https://www.newegg.ca/amd-ryzen-threadripper-pro-3975wx/p/N82E16819113677

So i think that AMD Ryzen Threadripper PRO 3975WX 32-Core CPU is competitive in performance and price for the GFLOPS with Nvidia V100 PCIe (Volta) 16 GB.

I have just read the following interesting article about AVX512

On the dangers of Intel's frequency scaling

https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

So as you have just noticed by reading the above article that you have

not to use AVX512, because it heats a lot the CPU cores, so what Intel

is doing is to reduce a lot the speed of the CPU cores, but this is not good for performance. So what i advice is to avoid AVX2 or AVX512

and choose to use AVX that has not this problem. And AMD Ryzen Threadripper PRO 3975WX 32-Core that i will buy the next month also supports AVX2.

More about me and about fault-tolerant computer systems and more..

I am a white arab, and i think i am smart since i have also

invented many scalable algorithms and algorithms..

I have come to Canada when i was 20 years old, and i am living

in Canada Quebec for 32 years and now i am 52 years old , but i am genetically an athletic guy and i feel that i am still young because

i am more athletic and i am 6 Feet tall, and i am beautiful

from the inside since i am a gentleman type of person and it is also

genetical in me, and i have worked as a software consultant with

some hospitals in USA, and i have worked with some computer hardware companies and software companies in British Colombia and in New Brunswick in Canada, and here is more about my education and my Diploma and more:

My name is Amine Moulay Ramdane, i am a white arab from Morocco, and

i think i am smart since i have also invented many scalable algorithms

and algorithms, and i am a gentleman type of person, and i live in

Quebec Canada since year 1989, i am also a Canadian from Morocco, and

you have seen me writing my thoughts of my political philosophy here,

and now i will talk about my education and my Diploma: my Diploma is a

university level Diploma, my school in Morocco where i have studied and

gotten my university level Diploma in Microelectronics and informatics

was under the control of Paris Academie in France (we call it Académie

de Paris), and here it is:

https://translate.google.com/translate?hl=en&sl=auto&tl=en&u=https%3A%2F%2Ffr.wikipedia.org%2Fwiki%2FAcad%25C3%25A9mie_de_Paris

And i have continued to study one more year of applied mathematics in

university of Montreal in Quebec Canada, and i have succeeded this one

year in applied mathematics in university of Montreal, so with my

Diploma and this one year of applied mathematics i have studied and

succeeded 3 years at the university level, after that i have studied

Network administration and i have also worked as a network administrator

and as software developer consultant, the name of my company was and is

Cyber-NT Communications in Quebec Canada, and around years 2001 and

2002 i have started to implement some of my softwares like PerlZip that

looked like PkZip of PKware software company, but i have implemented it

for Perl , and i have implemented the Dynamic Link Libraries of my

PerlZip that permits to compress and decompress etc. with the

"Delphi"compiler, so my PerlZip software product was very fast

and very efficient, in year 2002 i have posted the Beta version on

internet, and as a proof , please read about it here:

http://computer-programming-forum.com/52-perl-modules/ea157f4a229fc720.htm

And after that i have sold the release version of my PerlZip

product to many many companies and to many individuals around the world,

and i have even sold it to many Banks in Europe, and with that i have

made more money.

And after that i have continued to work like a software developer

consultant and network administrator, the name of my company was and is

Cyber-NT Communications,

Here is my company in Quebec(Canada) called Cyber-NT Communications,

i have worked as a software developer and as a network administrator,

read the proof here:

https://opencorporates.com/companies/ca_qc/2246777231

Also read the following part of a somewhat old book of O'Reilly called

Perl for System Administration by David N. Blank-Edelman, and you will

notice that it contains my name and it speaks about some of my Perl modules:

https://www.oreilly.com/library/view/perl-for-system/1565926099/ch04s04.html

And you can find my Open source software projects here in my website:

https://sites.google.com/site/scalable68/

More philosophy about HP NonStop to x86 Server Platform fault-tolerant computer systems and more..

I am white arab and i think i am smart since i have also invented many scalable algorithms and algorithms..

Now HP to Extend HP NonStop to x86 Server Platform

HP announced in 2013 plans to extend its mission-critical HP NonStop technology to x86 server architecture, providing the 24/7 availability required in an always-on, globally connected world, and increasing customer choice.

Read the following to notice it:

https://www8.hp.com/us/en/hp-news/press-release.html?id=1519347#.YHSXT-hKiM8

And today HP provides HP NonStop to x86 Server Platform, and here is

an example, read here:

https://www.hpe.com/ca/en/pdfViewer.html?docId=4aa5-7443&parentPage=/ca/en/products/servers/mission-critical-servers/integrity-nonstop-systems&resourceTitle=HPE+NonStop+X+NS7+%E2%80%93+Redefining+continuous+availability+and+scalability+for+x86+data+sheet

So i think programming the HP NonStop for x86 is compatible with x86 CPU

architecture programming, so my following methodolody is working correctly, read it carefully since i have just extended my thoughts:

Here is my next powerful computer..

In the next month i will buy a powerful computer with the following powerful CPU:

AMD Ryzen Threadripper PRO 3975WX 32-Core 3.5 GHz

https://www.newegg.ca/amd-ryzen-threadripper-pro-3975wx/p/N82E16819113677

So my computer that i will buy in the next month will cost me around 9 thousands dollars, since i want to do some testing with the above CPU that comes with 32 cores and 8 memory channels, since i have invented many scalable algorithms and algorithms and i am writing two books about parallelism and concurrency that i will sell and i have invented some powerful tools for parallelism and concurrency that i will sell too etc.

So as you are noticing i am also buying a 3,499 US dollars CPU from USA

to make the USA economy works better.

Here is some benchmarks that shows a less powerful Threadripper 3970x AMD CPU with 4 channels of memory:

https://www.pugetsystems.com/labs/hpc/HPC-Parallel-Performance-for-3rd-gen-Threadripper-Xeon-3265W-and-EPYC-7742-HPL-HPCG-Numpy-NAMD-1717/

Also my next AMD Ryzen Threadripper PRO 3975WX 32-Core 3.5 GHz can be configured to work as 4 NUMA nodes, and the accessing time of far memory will be slower than accessing time of near memory by 1.6x times. So as you are noticing that my scalable algorithms such as my scalable MLock will work correctly, since what is important is scalability even if accessing time of far memory will be slower than accessing time of near memory by 1.6x times on my next AMD Ryzen Threadripper PRO 3975WX 32-Core 3.5 GHz.

About smartness and about MCS Lock and more..

I have just read the following article from ACM:

Scalability Techniques for Practical Synchronization Primitives

https://queue.acm.org/detail.cfm?id=2698990

Notice how they are speaking about one of the best scalable Lock that we call MCS lock, but i think that CLH and MCS locks are not smart since those scalable Locks are like intrusive, since they have to hide the required parameter to be passed, this is why i think i am smart since i have invented a scalable Lock that is better than MCS Lock since my scalable Lock doesn't require any parameter to be passed, just call the Enter() and Leave() methods and that's all, here it is, read carefully about it in my website here:

https://sites.google.com/site/scalable68/scalable-mlock

I have also just enhanced it more and i will post it soon.

I have also invented many other scalable algorithms and algorithms..

Here is some of them:

https://sites.google.com/site/scalable68/scalable-reference-counting-with-efficient-support-for-weak-references

https://sites.google.com/site/scalable68/scalable-rwlock

https://sites.google.com/site/scalable68/new-variants-of-scalable-rwlocks

https://groups.google.com/forum/#!topic/comp.programming.threads/VaOo1WVACgs

https://sites.google.com/site/scalable68/an-efficient-threadpool-engine-with-priorities-that-scales-very-well

Thank you,

Amine Moulay Ramdane.

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu