More precision about my new inventions of scalable algorithms..

Amine Moulay Ramdane

unread,

Feb 18, 2021, 2:07:15 PM2/18/21

to

Hello..

More precision about my new inventions of scalable algorithms..

My below powerful inventions of LW_Fast_RWLockX and Fast_RWLockX that are two powerful scalable RWLocks that are FIFO fair
and Starvation-free and costless on the reader side
(that means with no atomics and with no fences on the reader side), they use sys_membarrier expedited on Linux and FlushProcessWriteBuffers() on windows, and if you look at the source code of my LW_Fast_RWLockX.pas
and Fast_RWLockX.pas inside the zip file, you will notice that in Linux they call two functions that are membarrier1() and membarrier2(), the membarrier1() registers the process's intent to use MEMBARRIER_CMD_PRIVATE_EXPEDITED and membarrier2() executes a memory barrier on each running thread belonging to the same process as the calling thread.

Read more here to understand:

https://man7.org/linux/man-pages/man2/membarrier.2.html

Here is my new powerful inventions of scalable algorithms..

I have just updated my powerful inventions of LW_Fast_RWLockX and Fast_RWLockX that are two powerful scalable RWLocks that are FIFO fair
and Starvation-free and costless on the reader side (that means with no atomics and with no fences on the reader side), they use sys_membarrier expedited on Linux and FlushProcessWriteBuffers() on windows, and now they work with both Linux and Windows, and i think my inventions are really smart, since read the following PhD researcher,
he says the following:

"Until today, there is no known efficient reader-writer lock with starvation-freedom guarantees;"

Read more here:

http://concurrencyfreaks.blogspot.com/2019/04/onefile-and-tail-latency.html

So as you have just noticed he says the following:

"Until today, there is no known efficient reader-writer lock with starvation-freedom guarantees;"

So i think that my above powerful inventions of scalable reader-writer locks are efficient and FIFO fair and Starvation-free.

LW_Fast_RWLockX that is a lightweight scalable Reader-Writer Mutex that uses a technic that looks like Seqlock without looping on the reader side like Seqlock, and this has permitted the reader side to be costless, it is fair and it is of course Starvation-free and it does spin-wait, and also Fast_RWLockX a lightweight scalable Reader-Writer Mutex that uses a technic that looks like Seqlock without looping on the reader side like Seqlock, and this has permitted the reader side to be costless, it is fair and it is of course Starvation-free and it does not spin-wait, but waits on my SemaMonitor, so it is energy efficient.

You can read about them and download them from my website here:

https://sites.google.com/site/scalable68/scalable-rwlock

Also my other inventions are the following scalable RWLocks that are
FIFO fair and starvation-free:

Here is my invention of a scalable and starvation-free and FIFO fair and lightweight Multiple-Readers-Exclusive-Writer Lock called LW_RWLockX, it works across processes and threads:

https://sites.google.com/site/scalable68/scalable-rwlock-that-works-accross-processes-and-threads

And here is my inventions of New variants of Scalable RWLocks that are FIFO fair and Starvation-free:

https://sites.google.com/site/scalable68/new-variants-of-scalable-rwlocks

More about the energy efficiency of Transactional memory and more..

I have just read the following PhD paper, it is also about energy efficiency of Transactional memory, here it is:

Techniques for Enhancing the Efficiency of Transactional Memory Systems

http://kth.diva-portal.org/smash/get/diva2:1258335/FULLTEXT02.pdf

And i think it is the best known energy efficient algorithm for
Transactional memory, but i think it is not good, since
look at how for 64 cores the Beta parameter can be 16 cores,
so i think i am smart and i have just invented a much more energy efficient and powerful scalable fast Mutex and i have also just invented scalable RWLocks that are starvation-free and fair, read about them in my below writing and thoughts:

More about deadlocks and lock-based systems and more..

I have just read the following from an software engineer from Quebec Canada:

A deadlock-detecting mutex

https://faouellet.github.io/ddmutex/

And i have just understood rapidly his algorithm, but i think
his algorithm is not efficient at all, since we can find
if a graph has a strongly connected component in around a time complexity O(V+E), so then the algorithm above of the engineer from Quebec Canada takes around a time complexity of O(n*(V+E)), so it is not good.

So a much better way is to use my following way of detecting deadlocks:

DelphiConcurrent and FreepascalConcurrent are here

Read more here in my website:

https://sites.google.com/site/scalable68/delphiconcurrent-and-freepascalconcurrent

And i will soon enhance much more DelphiConcurrent and FreepascalConcurrent to support both Communication deadlocks
and Resource deadlocks.

About Transactional memory and locks..

I have just read the following paper about Transactional memory and locks:

http://sunnydhillon.net/assets/docs/concurrency-tm.pdf

I don't agree with the above paper, since read my following thoughts
to understand:

I have just invented a new powerful scalable fast mutex, and it has the following characteristics:

1- Starvation-free
2- Tunable fairness
3- It keeps efficiently and very low its cache coherence traffic
4- Very good fast path performance
5- And it has a good preemption tolerance.
6- It is faster than scalable MCS lock
7- It solves the problem of lock convoying

So my new invention also solves the following problem:

The convoy phenomenon

https://blog.acolyer.org/2019/07/01/the-convoy-phenomenon/

And here is my other new invention of a Scalable RWLock that works across processes and threads that is starvation-free and fair and i will soon enhance it much more and it will become really powerful:

https://sites.google.com/site/scalable68/scalable-rwlock-that-works-accross-processes-and-threads

And about Lock-free versus Lock, read my following post:

https://groups.google.com/forum/#!topic/comp.programming.threads/F_cF4ft1Qic

And about deadlocks, here is also how i have solved it, and i will soon enhance much more DelphiConcurrent and FreepacalConcurrent:

DelphiConcurrent and FreepascalConcurrent are here

Read more here in my website:

https://sites.google.com/site/scalable68/delphiconcurrent-and-freepascalconcurrent

So i think with my above scalable fast mutex and my scalable RWLocks
that are starvation-free and fair and by reading the following about composability of lock-based systems, you will notice that lock-based systems are still useful.

"About composability of lock-based systems..

Design your systems to be composable. Among the more galling claims of
the detractors of lock-based systems is the notion that they are somehow
uncomposable: “Locks and condition variables do not support modular
programming,” reads one typically brazen claim, “building large programs
by gluing together smaller programs[:] locks make this impossible.”9 The
claim, of course, is incorrect. For evidence one need only point at the
composition of lock-based systems such as databases and operating
systems into larger systems that remain entirely unaware of lower-level
locking.

There are two ways to make lock-based systems completely composable, and
each has its own place. First (and most obviously), one can make locking
entirely internal to the subsystem. For example, in concurrent operating
systems, control never returns to user level with in-kernel locks held;
the locks used to implement the system itself are entirely behind the
system call interface that constitutes the interface to the system. More
generally, this model can work whenever a crisp interface exists between
software components: as long as control flow is never returned to the
caller with locks held, the subsystem will remain composable.

Second (and perhaps counterintuitively), one can achieve concurrency and
composability by having no locks whatsoever. In this case, there must be
no global subsystem state—subsystem state must be captured in
per-instance state, and it must be up to consumers of the subsystem to
assure that they do not access their instance in parallel. By leaving
locking up to the client of the subsystem, the subsystem itself can be
used concurrently by different subsystems and in different contexts. A
concrete example of this is the AVL tree implementation used extensively
in the Solaris kernel. As with any balanced binary tree, the
implementation is sufficiently complex to merit componentization, but by
not having any global state, the implementation may be used concurrently
by disjoint subsystems—the only constraint is that manipulation of a
single AVL tree instance must be serialized."

Read more here:

https://queue.acm.org/detail.cfm?id=1454462

About mathematics and about abstraction..

I think my specialization is also that i have invented many software algorithms and software scalable algorithms and i am still inventing other software scalable algorithms and algorithms, those scalable algorithms and algorithms that i have invented are like inventing mathematical theorems that you prove and present in a higher level abstraction, but not only that but those algorithms and scalable algorithms of mine are presented in a form of higher level software abstraction that abstract the complexity of my scalable algorithms and algorithms, it is the most important part that interests me, for example notice how i am constructing higher level abstraction in my following tutorial as methodology that, first, permits to model the synchronization objects of parallel programs with logic primitives with If-Then-OR-AND so that to make it easy to translate to Petri nets so that to detect deadlocks in parallel programs, please take a look at it in my following web link because this tutorial of mine is the way of learning by higher level abstraction:

How to analyse parallel applications with Petri Nets

https://sites.google.com/site/scalable68/how-to-analyse-parallel-applications-with-petri-nets

So notice that my methodology is a generalization that solves communication deadlocks and resource deadlocks in parallel programs.

1- Communication deadlocks that result from incorrect use of
event objects or condition variables (i.e. wait-notify
synchronization).

2- Resource deadlocks, a common kind of deadlock in which a set of
threads blocks forever because each thread in the set is waiting to
acquire a lock held by another thread in the set.

This is what interests me in mathematics, i want to work efficiently in mathematics in a much higher level of abstraction, i give you
an example of what i am doing in mathematics so that you understand,
look at how i am implementing mathematics as a software parallel conjugate gradient system solvers that scale very well, and i am presenting them in a higher level of abstraction, this is how i am abstracting the mathematics of them, read the following about it to notice it:

About SOR and Conjugate gradient mathematical methods..

I have just looked at SOR(Successive Overrelaxation Method),
and i think it is much less powerful than Conjugate gradient method,
read the following to notice it:

COMPARATIVE PERFORMANCE OF THE CONJUGATE GRADIENT AND SOR METHODS
FOR COMPUTATIONAL THERMAL HYDRAULICS

https://inis.iaea.org/collection/NCLCollectionStore/_Public/19/055/19055644.pdf?r=1&r=1

This is why i have implemented in both C++ and Delphi my Parallel Conjugate Gradient Linear System Solver Library that scales very well, read my following thoughts about it to understand more:

About the convergence properties of the conjugate gradient method

The conjugate gradient method can theoretically be viewed as a direct method, as it produces the exact solution after a finite number of iterations, which is not larger than the size of the matrix, in the absence of round-off error. However, the conjugate gradient method is unstable with respect to even small perturbations, e.g., most directions are not in practice conjugate, and the exact solution is never obtained. Fortunately, the conjugate gradient method can be used as an iterative method as it provides monotonically improving approximations to the exact solution, which may reach the required tolerance after a relatively small (compared to the problem size) number of iterations. The improvement is typically linear and its speed is determined by the condition number κ(A) of the system matrix A: the larger is κ(A), the slower the improvement.

Read more here:

http://pages.stat.wisc.edu/~wahba/stat860public/pdf1/cj.pdf

So i think my Conjugate Gradient Linear System Solver Library
that scales very well is still very useful, read about it
in my writing below:

Read the following interesting news:

The finite element method finds its place in games

Read more here:

https://translate.google.com/translate?hl=en&sl=auto&tl=en&u=https%3A%2F%2Fhpc.developpez.com%2Factu%2F288260%2FLa-methode-des-elements-finis-trouve-sa-place-dans-les-jeux-AMD-propose-la-bibliotheque-FEMFX-pour-une-simulation-en-temps-reel-des-deformations%2F

But you have to be aware that finite element method uses Conjugate Gradient Method for Solution of Finite Element Problems, read here to notice it:

Conjugate Gradient Method for Solution of Large Finite Element Problems on CPU and GPU

https://pdfs.semanticscholar.org/1f4c/f080ee622aa02623b35eda947fbc169b199d.pdf

This is why i have also designed and implemented my Parallel Conjugate Gradient Linear System Solver library that scales very well,
here it is:

My Parallel C++ Conjugate Gradient Linear System Solver Library
that scales very well version 1.76 is here..

Author: Amine Moulay Ramdane

Description:

This library contains a Parallel implementation of Conjugate Gradient Dense Linear System Solver library that is NUMA-aware and cache-aware that scales very well, and it contains also a Parallel implementation of Conjugate Gradient Sparse Linear System Solver library that is cache-aware that scales very well.

Sparse linear system solvers are ubiquitous in high performance computing (HPC) and often are the most computational intensive parts in scientific computing codes. A few of the many applications relying on sparse linear solvers include fusion energy simulation, space weather simulation, climate modeling, and environmental modeling, and finite element method, and large-scale reservoir simulations to enhance oil recovery by the oil and gas industry.

Conjugate Gradient is known to converge to the exact solution in n steps for a matrix of size n, and was historically first seen as a direct method because of this. However, after a while people figured out that it works really well if you just stop the iteration much earlier - often you will get a very good approximation after much fewer than n steps. In fact, we can analyze how fast Conjugate gradient converges. The end result is that Conjugate gradient is used as an iterative method for large linear systems today.

Please download the zip file and read the readme file inside the zip to know how to use it.

You can download it from:

https://sites.google.com/site/scalable68/scalable-parallel-c-conjugate-gradient-linear-system-solver-library

Language: GNU C++ and Visual C++ and C++Builder

Operating Systems: Windows, Linux, Unix and Mac OS X on (x86)

--

As you have noticed i have just written above about my Parallel C++ Conjugate Gradient Linear System Solver Library that scales very well, but here is my Parallel Delphi and Freepascal Conjugate Gradient Linear System Solvers Libraries that scale very well:

Parallel implementation of Conjugate Gradient Dense Linear System solver library that is NUMA-aware and cache-aware that scales very well

https://sites.google.com/site/scalable68/scalable-parallel-implementation-of-conjugate-gradient-dense-linear-system-solver-library-that-is-numa-aware-and-cache-aware

PARALLEL IMPLEMENTATION OF CONJUGATE GRADIENT SPARSE LINEAR SYSTEM SOLVER LIBRARY THAT SCALES VERY WELL

https://sites.google.com/site/scalable68/scalable-parallel-implementation-of-conjugate-gradient-sparse-linear-system-solver

Thank you,
Amine Moulay Ramdane.

Bonita Montero

unread,

Feb 19, 2021, 5:27:24 PM2/19/21

to

> My below powerful inventions of LW_Fast_RWLockX and Fast_RWLockX that are two powerful scalable RWLocks that are FIFO fair

RW-locks are never fair.

Chris M. Thomasson

unread,

Feb 19, 2021, 5:48:18 PM2/19/21

to

There are implementations that can be starvation free.

Bonita Montero

unread,

Feb 19, 2021, 6:01:33 PM2/19/21

to

>>> My below powerful inventions of LW_Fast_RWLockX and Fast_RWLockX that
>>> are two powerful scalable RWLocks that are FIFO fair

>> RW-locks are never fair.

> There are implementations that can be starvation free.

No, impossible. Readers can hold the lock as long as they
want and starve out writers. And writers could do the same.

Amine Moulay Ramdane

unread,

Feb 19, 2021, 6:14:03 PM2/19/21

to

You are right Chris M. Thomasson, and my inventions of scalable RWLocks are fair and Starvation-free,
just look at the source code of them.

unread,

Feb 20, 2021, 2:50:47 AM2/20/21

to

> Ummm... I am not sure what you mean here. ...

Readers can hold the mutex as long as they want.
And writers can do the opposite.
So they can prevent other writers or readers to lock the mutex.
So RW-locks are never starvation-free.

Chris M. Thomasson

unread,

Feb 20, 2021, 3:46:34 PM2/20/21

to

On 2/19/2021 11:50 PM, Bonita Montero wrote:
>> Ummm... I am not sure what you mean here. ...
>
> Readers can hold the mutex as long as they want.
> And writers can do the opposite.

Indeed. But you snipped my relevant part.

> So they can prevent other writers or readers to lock the mutex.
> So RW-locks are never starvation-free.

Let me manually quote what I wrote. Its important because the type of
starvation your are talking about is very different than the usual meaning.

Chris M. Thomasson wrote:
_________________________________
[...]

The type of starvation I am talking about is when reader activity is so
frequent that writers just can't seem to get in. Even if the reader
critical sections are short. Every time a writer wants to get in,
several readers jump in line. Its all about reader or writer priority, a
lot of rwmutex impls have that flaw. Sure a reader can hold the lock for
an hour, okay. Well, writers cannot get in for an hour. That's perfectly
normal. This is not the same type of starvation I am referring to...

_________________________________

Starvation on a rw mutex can occur when heavy reader activity actually
starves out writers. Some implementations can actually prioritize reads,
or writes. Some implementations have reader preference by design. So,
starvation can occur when frequent readers prevent writers from
acquiring write access. In other words, writers spend most of their time
tying and failing to acquire the mutex. This is why its important to be
very careful when one is implementing a rwmutex. I linked you to an
algorithm I invented that is loop-free. It fairly alternates between
batches of readers and writers. So, readers can never starve out
writers, even during massive reader activity, and vise versa.

Chris M. Thomasson

unread,

Feb 20, 2021, 3:52:47 PM2/20/21

to

Heavy writer activity starving out readers can happen, but its rather
unusual. General use patterns for a read write mutex is when you have
read mostly workloads.

Bonita Montero

unread,

Feb 21, 2021, 11:16:47 AM2/21/21

to

> I linked you to an algorithm I invented that is loop-free.
> It fairly alternates between batches of readers and writers.

It may alternate, but surely not fair.

Richard Damon

unread,

Feb 21, 2021, 1:15:29 PM2/21/21

to

The problem is there isn't a uniform definition of 'fair', but fairness
requires making value decisions.

Remind me of a discussion of 'fairness' in scheduling algorithms when I
was in collage. This was in the days when you would key punch your
program and submit the deck to the computer center and get your results
some time later.

The question is, which of these algorithms is 'fairer':

60 People come up and submit their jobs at basically the same time. Each
Job is going to take 1 minute of CPU time to run. Which of these
scheduling algorithms is 'fairer'

Method 1, First job in line run to completion, then the second job is
run to completion, then the 3rd and so on. So the first person gets his
results after a minute, next a minute later, and so on so the last gets
their results 60 minutes later.

Method 2, All the jobs are loaded into memory at once, and time sliced
between them, each getting 1 second of time, then the next, and so on.
Every job will finish at the end of the hour.

Many people will label the second method as 'fairer' as everyone gets
the same pain, but no one in the second example was any better off then
in the first, and almost everyone worse, so by most efficiency measures
it is clearly worse.

Now, if all the tasks don't take exactly the same length to run, or
aren't all submitted at once, some of the value arguments of what should
be fair come into play, but 'fairness' can be hard to define.

Bonita Montero

unread,

Feb 21, 2021, 1:59:01 PM2/21/21

to

> 60 People come up and submit their jobs at basically the same time.
> Each Job is going to take 1 minute of CPU time to run. Which of these
> scheduling algorithms is 'fairer'
> Method 1, First job in line run to completion, then the second job is
> run to completion, then the 3rd and so on. So the first person gets his
> results after a minute, next a minute later, and so on so the last gets
> their results 60 minutes later.
> Method 2, All the jobs are loaded into memory at once, and time sliced
> between them, each getting 1 second of time, then the next, and so on.
> Every job will finish at the end of the hour.
> Many people will label the second method as 'fairer' as everyone gets
> the same pain, but no one in the second example was any better off then
> in the first, and almost everyone worse, so by most efficiency measures
> it is clearly worse.

You can't compare that to the discussed issue because there's no
opportunity to capture all the CPU-time by one task.

Richard Damon

unread,

Feb 21, 2021, 3:58:35 PM2/21/21

to

The key point isn't what resource was being managed, but how do we
really define 'fair'. The key part about the example was that many
people thought that the second method was fairer, because there was much
less difference is response times to all clients, even though it was at
the cost of making almost everyone wait longer than they needed to in
the first. The first method has an average response time of 1/2 the
second, and no one got worse performance. Only by a metric on variation
in response can the second be shown to be superior, but some think that
makes it fairer.

To define 'fairness' we need to define some metric to measure it.
Normally we can't get 'perfect' fairness except in vary special
conditions. Normally we define a system as 'fair' if every client can
get some reasonable amount of the resource (even if it isn't a 'fair
share'). Tighter definitions of 'fairness' might try to limit the
inequities in allocation (and if the clients are identical in needs, we
need ways to 'fairly' allocate needs to measure inequality). Generally,
attempts to make a system 'fairer' can often impact the efficiency of
the system. In my example, to reduce the variance in the wait times, the
average wait time doubled, and perhaps by increasing the number of
context switches we might have actually increased the total time to run
all the jobs.

My key point is that the first step is we need to define what we want to
meet as 'fair', and then we can decide if a given strategy will meet
that definition (and maybe how well).

Thus, if you want to call something 'unfair', you need to define what
you mean by fair. It is quite possible to define a RW-lock that will see
to it that as long as no locker fails to release its lock in a
reasonable time period, all requests WILL get met eventually. It just
requires that at some point after a write request gets queue, the system
stops granting new read requests (so the writer will get eventually get
in when all current readers end), and that a somewhat fair method (like
oldest) is used for granting requests when the system is free. Yes,
perhaps the simplest RW-lock will not be able to do this, but then the
issue isn't about RW-locks in general, but that style of
implementations. IF you are using some other definition of 'fair', you
need to define it.

Chris M. Thomasson

unread,

Feb 21, 2021, 5:35:25 PM2/21/21

to

I agree. Fwiw, my rwmutex does not allow readers to starve out writers
and vise versa. So, if a writer requests the lock while there are a shit
load of concurrent readers, it will wait until all of the readers in the
current "batch" are completed and will automatically own the lock for
write access after that. If a batch of readers come in while the writer
is doing its thing, they will be released all at once and automatically
own reader access once the writer is finished. This way, readers cannot
stave out writers and vise versa. Now, if one does not desire that type
of alternating, basically a classic bakery algorihtm like behavior, then
do NOT use my rwmutex.

https://vorbrodt.blog/2019/02/14/read-write-mutex

I like how he gave me proper credit. Nice. :^)

Actually, my rwmutex is pretty short and sweet with no loops.

unread,

Feb 22, 2021, 1:00:30 AM2/22/21

to

True, you have again snipped out the relevant issue of classic
starvation I posted where frequent heavy read activity can stave out
writers, and vise versa. You are familiar with reader/writer priority on
rwmutex right? My impl alternates them in batches.

Chris M. Thomasson

unread,

Feb 22, 2021, 1:01:23 AM2/22/21

to

Well, lets say five threads have read access... There is nothing
stopping them from using all cpus!

Chris M. Thomasson

unread,

Feb 22, 2021, 1:10:43 AM2/22/21

to

I do know what you mean, its just not the classic way of thinking about
starvation. You are 100% correct that a batch of readers, or a single
writer can hold the lock for hours. It can happen, well, its not
advised, but it still can occur.

Bonita Montero

unread,

Feb 22, 2021, 1:41:04 AM2/22/21

to

> True, you have again snipped out the relevant issue of classic
> starvation I posted where frequent heavy read activity can stave
> out writers, and vise versa. You are familiar with reader/writer
> priority on rwmutex right? My impl alternates them in batches.

That doesn't make fairness.

unread,

Feb 23, 2021, 4:51:28 PM2/23/21

to

And the standard definition of a 'fair locking' system is one that is
fair when the follow listed rules, like maximum hold time of a lock.

The only way to handle unruly clients is who hold their locks too long
is to have OS support to kill those clients and have it unlock the locks
it held on to. (It could not kill but just deny further access depending
on the resource).

Chris M. Thomasson

unread,

Feb 25, 2021, 6:10:27 AM2/25/21

to

On 2/25/2021 3:09 AM, Chris M. Thomasson wrote:
> On 2/25/2021 2:43 AM, Bonita Montero wrote:
>>>> If you have shared memory modified while holding the lock there's
>>>> nothing to fix.
>>
>>> have you ever had to encounter and deal with EOWNERDEAD, or
>>> WAIT_ABANDONDED ?
>>
>> Boy, you're stupid. If you're modifying memory while holding a lock
>> and the lock is deleted meanwhile, WAIT_ABANDONNED or whatever doesn't
>> help.

I am not referring to deleting a mutex, I am talking about a process
crashing in the middle of a critical section.