Hello,
I think i have just found the solution for buffer overflow problem for memory safety, i will explain it now:
As you have noticed i have implemented my Getmem_aligned() and Freemem_aligned() for Delphi and Freepascal, here it is:
https://sites.google.com/site/scalable68/getmem_aligned-for-delphi-and-freepascal
I think you can get the idea from the source code of it,
as you have noticed i am doing this (it is in modern Delphi Object pascal for Delphi and Freepascal compilers):
================================================
procedure getmem_aligned(alignment:cardinal;var ptr:pointer;size:cardinal);
var ptr1,ptr2,ptr3:pointer;
begin
ptr := AllocMem(size + (2*alignment) + sizeof(pointer));
ptr1:=pointer(int(ptr)+sizeof(pointer));
ptr2 := Pointer((int(ptr1) + Alignment - 1) and not (Alignment - 1));
ptr3:=pointer(int(ptr2)-sizeof(pointer));
intptr(ptr3)^:=int(ptr);
ptr:=ptr2;
end;
procedure freemem_aligned(ptr:pointer);
var ptr1,ptr2:pointer;
begin
ptr1:=pointer(int(ptr)-sizeof(pointer));
int(ptr2):=intptr(ptr1)^;
freemem(ptr2);
end;
========================================================
So as you have noticed i am adding a memory size of a "pointer", like this:
AllocMem(size + (2*alignment) + sizeof(pointer));
So the idea for buffer overflow is to add one other pointer to the AllocMem() that you access like i am accessing the real pointer in my above Freemem_aligned(), like this:
AllocMem(size + (2*alignment) + sizeof(pointer) + sizeof(pointer));
So you will have two fields , one for the size of the reserved memory and one for the address of the real pointer, after that
you will code the CopyMemory() that works with Pointers and PWidechar and Pchar types of modern Object pascal for Delphi and Freepascal compilers, and i think this is easy to do for me and the new CopyMemory() will issue an exception if there is a buffer overflow and it will solve the problem of buffer overflow for Delphi and
Freepascal, and you can use jclDebug or madExcept or EurekaLog for Delphi to print the line of the source code where the Buffer overflow exception happened, here is the free jclDebug that you can get from here:
https://wiki.delphi-jedi.org/wiki/JCL_Help:JclDebug.pas
So read my previous thoughts about "Fearless Security: Memory safety"
to understand more my own thoughts:
I have just read the following webpage about "Fearless Security: Memory safety":
https://hacks.mozilla.org/2019/01/fearless-security-memory-safety/
Here is the memory safety problems:
1- Misusing Free (use-after-free, double free)
I have solved this in Delphi and Freepascal by inventing a "Scalable" reference counting with efficient support for weak references. Read below about it.
2- Uninitialized variables
This can be detected by the compilers of Delphi and Freepascal.
3- Dereferencing Null pointers
I have solved this in Delphi and Freepascal by inventing a "Scalable" reference counting with efficient support for weak references. Read below about it.
4- Buffer overflow and underflow
This has been solved in Delphi by using madExcept, read here about it:
http://help.madshi.net/DebugMm.htm
You can buy it from here:
http://www.madshi.net/
And about race conditions and deadlocks problems and more, read my following thoughts to understand:
I will reformulate more smartly what about race conditions detection in Rust, so read it carefully:
You can think of the borrow checker of Rust as a validator for a locking system: immutable references are shared read locks and mutable references are exclusive write locks. Under this mental model, accessing data via two independent write locks is not a safe thing to do, and modifying data via a write lock while there are readers alive is not safe either.
So as you are noticing that the "mutable" references in Rust follow the Read-Write Lock pattern, so this is not good, because it is not like more fine-grained parallelism that permits us to run the writes in "parallel" and gain more performance from parallelizing the writes.
Read more about Rust and Delphi and my inventions..
I think the spirit of Rust is like the spirit of ADA, they are especially designed for the very high standards of safety, like those of ADA, "but" i don't think we have to fear race conditions that Rust solve, because i think that race conditions are not so difficult to avoid when you are a decent knowledgeable programmer in parallel programming, so you have to understand what i mean, now we have to talk about the rest of the safety guaranties of Rust, there remain the problem of Deadlocks, and i think that Rust is not solving this problem, but i have provided you with an enhanced DelphiConcurrent library for Delphi and Freepascal that detects deadlocks, and there is also the Memory Safety guaranties of Rust, here they are:
1- No Null Pointer Dereferences
2- No Dangling Pointers
3- No Buffer Overruns
But notice that I have solved the number 1 and number 2 by inventing my
scalable reference counting with efficient support for weak references
for Delphi and Freepascal, read below to notice it, and for number 3 read my following thoughts to understand:
More about research and software development..
I have just looked at the following new video:
Why is coding so hard...
https://www.youtube.com/watch?v=TAAXwrgd1U8
I am understanding this video, but i have to explain my work:
I am not like this techlead in the video above, because i am also an "inventor" that has invented many scalable algorithms and there implementions, i am also inventing effective abstractions, i give you an example:
Read the following of the senior research scientist that is called Dave Dice:
Preemption tolerant MCS locks
https://blogs.oracle.com/dave/preemption-tolerant-mcs-locks
As you are noticing he is trying to invent a new lock that is preemption tolerant, but his lock lacks some important characteristics, this is why i have just invented a new Fast Mutex that is adaptative and that is much much better and i think mine is the "best", and i think you will not find it anywhere, my new Fast Mutex has the following characteristics:
1- Starvation-free
2- Good fairness
3- It keeps efficiently and very low the cache coherence traffic
4- Very good fast path performance (it has the same performance as the
scalable MCS lock when there is contention.)
5- And it has a decent preemption tolerance.
this is how i am an "inventor", and i have also invented other scalable algorithms such as a scalable reference counting with efficient support for weak references, and i have invented a fully scalable Threadpool, and i have also invented a Fully scalable FIFO queue, and i have also invented other scalable algorithms and there inmplementations, and i think i will sell some of them to Microsoft or to
Google or Embarcadero or such software companies.
Read my following writing to know me more:
More about computing and parallel computing..
The important guaranties of Memory Safety in Rust are:
1- No Null Pointer Dereferences
2- No Dangling Pointers
3- No Buffer Overruns
I think i have solved Null Pointer Dereferences and also solved Dangling Pointers and also solved memory leaks for Delphi and Freepascal by inventing my "scalable" reference counting with efficient support for weak references and i have implemented it in Delphi and Freepascal (Read about it below), and reference counting in Rust and C++ is "not" scalable.
About the (3) above that is Buffer Overruns, read here about Delphi and Freepascal:
What's a buffer overflow and how to avoid it in Delphi?
read my above thoughts about it.
About Deadlock and Race conditions in Delphi and Freepascal:
I have ported DelphiConcurrent to Freepascal, and i have
also extended them with the support of my scalable RWLocks for Windows and Linux and with the support of my scalable lock called MLock for Windows and Linux and i have also added the support for a Mutex for Windows and Linux, please look inside the DelphiConcurrent.pas and FreepascalConcurrent.pas files inside the zip file to understand more.
You can download DelphiConcurrent and FreepascalConcurrent for Delphi and Freepascal from:
https://sites.google.com/site/scalable68/delphiconcurrent-and-freepascalconcurrent
DelphiConcurrent and FreepascalConcurrent by Moualek Adlene is a new way to build Delphi applications which involve parallel executed code based on threads like application servers. DelphiConcurrent provides to the programmers the internal mechanisms to write safer multi-thread code while taking a special care of performance and genericity.
In concurrent applications a DEADLOCK may occurs when two threads or more try to lock two consecutive shared resources or more but in a different order. With DelphiConcurrent and FreepascalConcurrent, a DEADLOCK is detected and automatically skipped - before he occurs - and the programmer has an explicit exception describing the multi-thread problem instead of a blocking DEADLOCK which freeze the application with no output log (and perhaps also the linked clients sessions if we talk about an application server).
Amine Moulay Ramdane has extended them with the support of his scalable RWLocks for Windows and Linux and with the support of his scalable lock called MLock for Windows and Linux and he has also added the support for a Mutex for Windows and Linux, please look inside the DelphiConcurrent.pas and FreepascalConcurrent.pas files to
understand more.
And please read the html file inside to learn more how to use it.
About race conditions now:
My scalable Adder is here..
As you have noticed i have just posted previously my modified versions of DelphiConcurrent and FreepascalConcurrent to deal with deadlocks in parallel programs.
But i have just read the following about how to avoid race conditions in Parallel programming in most cases..
Here it is:
https://vitaliburkov.wordpress.com/2011/10/28/parallel-programming-with-delphi-part-ii-resolving-race-conditions/
This is why i have invented my following powerful scalable Adder to help you do the same as the above, please take a look at its source code to understand more, here it is:
https://sites.google.com/site/scalable68/scalable-adder-for-delphi-and-freepascal
Other than that, about composability of lock-based systems now:
Design your systems to be composable. Among the more galling claims of the detractors of lock-based systems is the notion that they are somehow uncomposable:
“Locks and condition variables do not support modular programming,” reads one typically brazen claim, “building large programs by gluing together smaller programs[:] locks make this impossible.”9 The claim, of course, is incorrect. For evidence one need only point at the composition of lock-based systems such as databases and operating systems into larger systems that remain entirely unaware of lower-level locking.
There are two ways to make lock-based systems completely composable, and each has its own place. First (and most obviously), one can make locking entirely internal to the subsystem. For example, in concurrent operating systems, control never returns to user level with in-kernel locks held; the locks used to implement the system itself are entirely behind the system call interface that constitutes the interface to the system. More generally, this model can work whenever a crisp interface exists between software components: as long as control flow is never returned to the caller with locks held, the subsystem will remain composable.
Second (and perhaps counterintuitively), one can achieve concurrency and
composability by having no locks whatsoever. In this case, there must be
no global subsystem state—subsystem state must be captured in per-instance state, and it must be up to consumers of the subsystem to assure that they do not access their instance in parallel. By leaving locking up to the client of the subsystem, the subsystem itself can be used concurrently by different subsystems and in different contexts. A concrete example of this is the AVL tree implementation used extensively in the Solaris kernel. As with any balanced binary tree, the implementation is sufficiently complex to merit componentization, but by not having any global state, the implementation may be used concurrently by disjoint subsystems—the only constraint is that manipulation of a single AVL tree instance must be serialized.
Read more here:
https://queue.acm.org/detail.cfm?id=1454462
And about Message Passing Process Communication Model and Shared Memory Process Communication Model:
An advantage of shared memory model is that memory communication is faster as compared to the message passing model on the same machine.
Read the following to notice it:
Why did Windows NT move away from the microkernel?
"The main reason that Windows NT became a hybrid kernel is speed. A microkernel-based system puts only the bare minimum system components in the kernel and runs the rest of them as user mode processes, known as servers. A form of inter-process communication (IPC), usually message passing, is used for communication between servers and the kernel.
Microkernel-based systems are more stable than others; if a server crashes, it can be restarted without affecting the entire system, which couldn't be done if every system component was part of the kernel. However, because of the overhead incurred by IPC and context-switching, microkernels are slower than traditional kernels. Due to the performance costs of a microkernel, Microsoft decided to keep the structure of a microkernel, but run the system components in kernel space. Starting in Windows Vista, some drivers are also run in user mode."
More about message passing..
An advantage of shared memory model is that memory communication is faster as compared to the message passing model on the same machine.
Read the following to notice it:
"One problem that plagues microkernel implementations is relatively poor performance. The message-passing layer that connects
different operating system components introduces an extra layer of
machine instructions. The machine instruction overhead introduced
by the message-passing subsystem manifests itself as additional
execution time. In a monolithic system, if a kernel component needs
to talk to another component, it can make direct function calls
instead of going through a third party."
However, shared memory model may create problems such as synchronization and memory protection that need to be addressed.
Message passing’s major flaw is the inversion of control–it is a moral equivalent of gotos in un-structured programming (it’s about time somebody said that message passing is considered harmful).
Also some research shows that the total effort to write an MPI application is significantly higher than that required to write a shared-memory version of it.
And more about my scalable reference counting with efficient support for weak references:
My invention that is my scalable reference counting with efficient support for weak references version 1.37 is here..
Here i am again, i have just updated my scalable reference counting with
efficient support for weak references to version 1.37, I have just added a TAMInterfacedPersistent that is a scalable reference counted version,
and now i think i have just made it complete and powerful.
Because I have just read the following web page:
https://www.codeproject.com/Articles/1252175/Fixing-Delphis-Interface-Limitations
But i don't agree with the writting of the guy of the above web page, because i think you have to understand the "spirit" of Delphi, here is why:
A component is supposed to be owned and destroyed by something else, "typically" a form (and "typically" means in english: in "most" cases, and this is the most important thing to understand). In that scenario, reference count is not used.
If you pass a component as an interface reference, it would be very unfortunate if it was destroyed when the method returns.
Therefore, reference counting in TComponent has been removed.
Also because i have just added TAMInterfacedPersistent to my invention.
To use scalable reference counting with Delphi and FreePascal, just replace TInterfacedObject with my TAMInterfacedObject that is the scalable reference counted version, and just replace TInterfacedPersistent with my TAMInterfacedPersistent that is the scalable reference counted version, and you will find both my TAMInterfacedObject and my TAMInterfacedPersistent
inside the AMInterfacedObject.pas file, and to know how to use weak references please take a look at the demo that i have included called example.dpr and look inside my zip file at the tutorial about weak references, and to know how to use delegation take a look at the demo that i have included called test_delegation.pas, and take a look inside my zip file at the tutorial about delegation that learns you how to use delegation.
I think my Scalable reference counting with efficient support for
weak references is stable and fast, and it works on both Windows and Linux, and my scalable reference counting scales on multicore and NUMA systems, and you will not find it in C++ or Rust, and i don't think you will find it anywhere, and you have to know that this invention of mine solves the problem of dangling pointers and it solves the problem of memory leaks and my scalable reference counting is "scalable".
And please read the readme file inside the zip file that i have just
extended to make you understand more.
You can download my new scalable reference counting with efficient support for weak references version 1.37 from:
https://sites.google.com/site/scalable68/scalable-reference-counting-with-efficient-support-for-weak-references
And now i will talk about data dependency and parallel loops..
For a loop to be parallelized, every iteration must be independent of the others, one way to be sure of it is to execute the loop
in the direction of the incremented index of the loop and in the direction of the decremented index of the loop and verify if the results are the same. A data dependency happens if memory is modified: a loop has a data dependency if an iteration writes a variable that is read or write in another iteration of the loop. There is no data dependency if only one iteration reads or writes a variable or if many iterations read
the same variable without modifying it. So this is the "general" "rules".
Now there remains to know that you have for example to know how to construct the parallel for loop if there is an induction variable or if there is a reduction operation, i will give an example of them:
If we have the following (the code looks like Algol or modern Object Pascal):
IND:=0
For I:=1 to N
Do
Begin
IND := IND + 1;
A[I]:=B[IND];
End;
So as you are noticing since IND is an induction variable , so
to parallelize the loop you have to do the following:
For I:=1 to N
Do
Begin
IND:=(I*(I+1))/2;
A[I]:=B[IND];
End;
Now for the reduction operation example, you will notice that my invention that is my Threadpool with priorities that scales very well (
read about it below) supports a Parallel For that scales very well that supports "grainsize", and you will notice that the grainsize can be used in the ParallelFor() with a reduction operation and you will notice that my following powerful scalable Adder is also used in this scenario, here it is:
https://sites.google.com/site/scalable68/scalable-adder-for-delphi-and-freepascal
So here is the example with a reduction operation in modern Object Pascal:
TOTAL:=0.0
For I := 1 to N
Do
Begin
TOTAL:=TOTAL+A[I]
End;
So with my powerful scalable Adder and with my powerful invention that is my ParallelFor() that scales very well, you will parallelize the above like this:
procedure test1(j:integer;ptr:pointer);
begin
t.add(A[J]); // "t" is my scalable Adder object
end;
// Let's suppose that N is 100000
// In the following, 10000 is the grainsize
obj.ParallelFor(1,N,test1,10000,pointer(0));
TOTAL:=T.get();
And read the following to understand how to use grainsize of my Parallel for that scales well:
About my ParallelFor() that scales very well that uses my efficient Threadpool that scales very well:
With ParallelFor() you have to:
1- Ensure Sufficient Work
Each iteration of a loop involves a certain amount of work,
so you have to ensure a sufficient amount of the work,
read below about "grainsize" that i have implemented.
2- In OpenMP we have that:
Static and Dynamic Scheduling
One basic characteristic of a loop schedule is whether it is static or dynamic:
• In a static schedule, the choice of which thread performs a particular
iteration is purely a function of the iteration number and number of
threads. Each thread performs only the iterations assigned to it at the
beginning of the loop.
• In a dynamic schedule, the assignment of iterations to threads can
vary at runtime from one execution to another. Not all iterations are
assigned to threads at the start of the loop. Instead, each thread
requests more iterations after it has completed the work already
assigned to it.
But with my ParallelFor() that scales very well, since it is using my efficient Threadpool that scales very well, so it is using Round-robin scheduling and it uses also work stealing, so i think that this is sufficient.
Read the rest:
My Threadpool engine with priorities that scales very well is really powerful because it scales very well on multicore and NUMA systems, also it comes with a ParallelFor() that scales very well on multicores and NUMA systems.
You can download it from:
https://sites.google.com/site/scalable68/an-efficient-threadpool-engine-with-priorities-that-scales-very-well
Here is the explanation of my ParallelFor() that scales very well:
I have also implemented a ParallelFor() that scales very well, here is the method:
procedure ParallelFor(nMin, nMax:integer;aProc: TParallelProc;GrainSize:integer=1;Ptr:pointer=nil;pmode:TParallelMode=pmBlocking;Priority:TPriorities=NORMAL_PRIORITY);
nMin and nMax parameters of the ParallelFor() are the minimum and maximum integer values of the variable of the ParallelFor() loop, aProc parameter of ParallelFor() is the procedure to call, and GrainSize integer parameter of ParallelFor() is the following:
The grainsize sets a minimum threshold for parallelization.
A rule of thumb is that grainsize iterations should take at least 100,000 clock cycles to execute.
For example, if a single iteration takes 100 clocks, then the grainsize needs to be at least 1000 iterations. When in doubt, do the following experiment:
1- Set the grainsize parameter higher than necessary. The grainsize is specified in units of loop iterations.
If you have no idea of how many clock cycles an iteration might take, start with grainsize=100,000.
The rationale is that each iteration normally requires at least one clock per iteration. In most cases, step 3 will guide you to a much smaller value.
2- Run your algorithm.
3- Iteratively halve the grainsize parameter and see how much the algorithm slows down or speeds up as the value decreases.
A drawback of setting a grainsize too high is that it can reduce parallelism. For example, if the grainsize is 1000 and the loop has 2000 iterations, the ParallelFor() method distributes the loop across only two processors, even if more are available.
And you can pass a parameter in Ptr as pointer to ParallelFor(), and you can set pmode parameter of to pmBlocking so that ParallelFor() is blocking or to pmNonBlocking so that ParallelFor() is non-blocking, and the Priority parameter is the priority of ParallelFor(). Look inside the test.pas example to see how to use it.
Thank you,
Amine Moulay Ramdane.