Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

More of my philosophy about mathematics and about C++ and more of my thoughts..

3 views

Skip to first unread message

Amine Moulay Ramdane

unread,

Nov 24, 2022, 11:38:41 AM11/24/22

Hello,

More of my philosophy about mathematics and about C++ and more of my thoughts..

I am a white arab, and i think i am smart since i have also
invented many scalable algorithms and algorithms..

I also program in Delphi and Freepascal, but i have also programmed in
C++, and here is my open source software project of Parallel C++ Conjugate Gradient Linear System Solver Library that scales very well , read about it in my following thoughts:

The matrix-vector multiplication of large matrices is completely limited by the memory bandwidth, so vector extensions like using SSE or AVX are usually not necessary for matrix-vector multiplication of large matrices. It is interesting that matrix-matrix-multiplications don't have these kind of problems with memory bandwitdh. Companies like Intel or AMD typically usually show benchmarks of matrix-matrix multiplications and they show how nice they scale on many more cores, but they never show matrix-vector multiplications, and notice that my Powerful Open source software project of Parallel C++ Conjugate Gradient Linear System Solver Library that scales very well is also memory-bound and the matrices for it are usually big, but my new algorithm of it is efficiently cache-aware and efficiently NUMA-aware, and i have implemented it for the dense and sparse matrices, and you can read about it in my following thoughts:

More of my philosophy about the 12 memory channels of
the new AMD Epyc Genoa CPU and more of my thoughts..

So as i am saying below, i think that so that to use 12 memory
channels in parallel that supports it the new AMD Genoa CPU, the GMI-Wide mode must enlarge more and connects each CCD with more GMI links, so i think that it is what is doing AMD in its new 4 CCDs configuration, even with the costs optimized Epyc Genoa 9124 16 cores with 64 MB of L3 cache with 4 Core Complex Dies (CCDs), that costs around $1000 (Look at it here: https://www.tomshardware.com/reviews/amd-4th-gen-epyc-genoa-9654-9554-and-9374f-review-96-cores-zen-4-and-5nm-disrupt-the-data-center ), and as i am explaining more below that the Core Complex Dies (CCDs) connect to memory, I/O, and each other through the I/O Die (IOD) and each CCD connects to the IOD via a dedicated high-speed, or Global Memory Interconnect (GMI) link and the IOD also contains memory channels, PCIe Gen5 lanes, and Infinity Fabric links and all dies, or chiplets, interconnect with each other via AMD’s Infinity Fabric Technology, and of course this will permit my new software project of Parallel C++ Conjugate Gradient Linear System Solver Library that scales very well to scale on the 12 memory channels, read my following thoughts so that to understand more about it:

More of my philosophy about the new Zen 4 AMD Ryzen™ 9 7950X and more of my thoughts..

So i have just looked at the new Zen 4 AMD Ryzen™ 9 7950X CPU, and i invite you to look at it here:

https://www.amd.com/en/products/cpu/amd-ryzen-9-7950x

But notice carefully that the problem is with the number of supported memory channels, since it just support two memory channels, so it is not good, since for example my following Open source software project of Parallel C++ Conjugate Gradient Linear System Solver Library that scales very well is scaling around 8X on my 16 cores Intel Xeon with 2 NUMA nodes and with 8 memory channels, but it will not scale correctly on the
new Zen 4 AMD Ryzen™ 9 7950X CPU with just 2 memory channels since it is also memory-bound, and here is my Powerful Open source software project of Parallel C++ Conjugate Gradient Linear System Solver Library that scales very well and i invite you to take carefully a look at it:

https://sites.google.com/site/scalable68/scalable-parallel-c-conjugate-gradient-linear-system-solver-library

So i advice you to buy an AMD Epyc CPU or an Intel Xeon CPU that supports 8 memory channels.

---

And of course you can use the next Twelve DDR5 Memory Channels for Zen 4 AMD EPYC CPUs so that to scale more my above algorithm, and read about it here:

https://www.tomshardware.com/news/amd-confirms-12-ddr5-memory-channels-on-genoa

And here is the simulation program that uses the probabilistic mechanism that i have talked about and that prove to you that my algorithm of my Parallel C++ Conjugate Gradient Linear System Solver Library is scalable:

If you look at my scalable parallel algorithm, it is dividing the each array of the matrix by 250 elements, and if you look carefully i am using two functions that consumes the greater part of all the CPU, it is the atsub() and asub(), and inside those functions i am using a probabilistic mechanism so that to render my algorithm scalable on NUMA architecture , and it also make it scale on the memory channels, what i am doing is scrambling the array parts using a probabilistic function and what i have noticed that this probabilistic mechanism is very efficient, to prove to you what i am saying , please look at the following simulation that i have done using a variable that contains the number of NUMA nodes, and what i have noticed that my simulation is giving almost a perfect scalability on NUMA architecture, for example let us give to the "NUMA_nodes" variable a value of 4, and to our array a value of 250, the simulation bellow will give a number of contention points of a quarter of the array, so if i am using 16 cores , in the worst case it will scale 4X throughput on NUMA architecture, because since i am using an array of 250 and there is a quarter of the array of contention points , so from the Amdahl's law this will give a scalability of almost 4X throughput on four NUMA nodes, and this will give almost a perfect scalability on more and more NUMA nodes, so my parallel algorithm is scalable on NUMA architecture and it also scale well on the memory channels,

Here is the simulation that i have done, please run it and you will notice yourself that my parallel algorithm is scalable on NUMA architecture.

Here it is:

---
program test;

uses math;

var tab,tab1,tab2,tab3:array of integer;
a,n1,k,i,n2,tmp,j,numa_nodes:integer;
begin

a:=250;
Numa_nodes:=4;

setlength(tab2,a);

for i:=0 to a-1
do
begin

tab2:=i mod numa_nodes;

end;

setlength(tab,a);

randomize;

for k:=0 to a-1
do tab:=k;

n2:=a-1;

for k:=0 to a-1
do
begin
n1:=random(n2);
tmp:=tab;
tab:=tab[n1];
tab[n1]:=tmp;
end;

setlength(tab1,a);

randomize;

for k:=0 to a-1
do tab1:=k;

n2:=a-1;

for k:=0 to a-1
do
begin
n1:=random(n2);
tmp:=tab1;
tab1:=tab1[n1];
tab1[n1]:=tmp;
end;

for i:=0 to a-1
do
if tab2[tab]=tab2[tab1] then
begin
inc(j);
writeln('A contention at: ',i);

end;

writeln('Number of contention points: ',j);
setlength(tab,0);
setlength(tab1,0);
setlength(tab2,0);
end.
---

More of my philosophy about 4 CCDs configuration of AMD Epyc Genoa CPU and more of my thoughts..

I have just read the following new paper about AMD 4th Gen EPYC 9004 Series, so i invite you to read it carefully:

https://hothardware.com/reviews/amd-genoa-data-center-cpu-launch

So read carefully the 4 CCDs configuration, so i am understanding
the following from it:

I/O DIE is what is connected to the memory channels externally, and it says that SKUs north of 4 CCDs (e.g. 32 cores) use the GMI3-Narrow configuration with a single GMI link per CCD. With 4 CCD and lower SKUs, AMD can implement GMI-Wide mode which joins each CCD to the IOD with two GMI links. In this case, one link of each CCD populates GMI0 to GMI3 while the other link of each CCD populates GMI8 to GMI11 as diagramed above. This helps these parts better balance against I/O demands.

So i think that that AMD implemented in his new 4 CCDs configuration the GMI-Wide mode which joins each CCD to the IOD with two GMI links, so that to be connected to the 8 memory channels externally and use them in parallel, so i think that the problem is solved, since i think that the cost optimized Epyc Genoa 9124 16 cores with 64 MB of L3 cache with 4 Core Complex Dies (CCDs), that costs around $1000 (Look at it here: https://www.tomshardware.com/reviews/amd-4th-gen-epyc-genoa-9654-9554-and-9374f-review-96-cores-zen-4-and-5nm-disrupt-the-data-center )
can use fully the 8 memory channels in parallel, so it is a good Epyc Genoa processor to buy.

And of course i invite you to read the following:

More of my philosophy about the new Epyc Genoa and about Core Complex Die (CCD) and Core-complex(CCX) and more of my thoughts..

I have just looked at the following paper from AMD and i invite
you to look at it:

https://developer.amd.com/wp-content/resources/56827-1-0.pdf

And as you notice above that you have to look at how many
Core Complex Dies (CCDs) you have, since it tells you more
about how many connections of Infinity Fabric you have, and it is
an important information, since look at the following article
about the new AMD Epyc Genoa:

https://wccftech.com/amd-epyc-genoa-cpu-lineup-specs-benchmarks-leak-up-to-2-6x-faster-than-intel-xeon/

And you can read much more of my thoughts about technology in the following web links:

https://groups.google.com/g/alt.culture.morocco/c/MosH5fY4g_Y

And here:

https://groups.google.com/g/soc.culture.usa/c/N_UxX3OECX4

More of my philosophy about my PERT++ and about my JNI Wrapper for Delphi and FreePascal and about the new Java SE Development Kit 19.0.1 and more of my thoughts..

I am a white arab, and i think i am smart since i have also
invented many scalable algorithms and algorithms..

I have just downloaded and installed the new Java SE Development Kit 19.0.1

Here it is:

https://www.oracle.com/java/technologies/javase/jdk19-archive-downloads.html

And i have just tested my open source JNI Wrapper for Delphi and FreePascal with the Java SE Development Kit 19.0.1, and it is working perfectly, so you can download my JNI Wrapper for Delphi and FreePascal from my website here:

https://sites.google.com/site/scalable68/jni-wrapper-for-delphi-and-freepascal

And I have also tested my other open source software project called PERT++ with the new Java SE Development Kit 19.0.1, and it is working perfectly, so you can download my PERT++ from my website here:

https://sites.google.com/site/scalable68/pert-an-enhanced-edition-of-the-program-or-project-evaluation-and-review-technique-that-includes-statistical-pert-in-delphi-and-freepascal

And I have in my PERT++ provided you with two ways of how to estimate the critical path, first, by the way of CPM(Critical Path Method) that shows all the arcs of the estimate of the critical path, and the second way is by the way of the central limit theorem by using the inverse normal distribution function, and you have to provide my software project that is called PERT++ with three types of estimates that are the following:

Optimistic time - generally the shortest time in which the activity
can be completed. It is common practice to specify optimistic times
to be three standard deviations from the mean so that there is
approximately a 1% chance that the activity will be completed within
the optimistic time.

Most likely time - the completion time having the highest
probability. Note that this time is different from the expected time.

Pessimistic time - the longest time that an activity might require. Three standard deviations from the mean is commonly used for the pessimistic time.

The central limit theorem states that the sampling distribution of the mean of any independent, random variable will be normal or nearly normal, if the sample size is large enough.

How large is "large enough"?

In practice, some statisticians say that a sample size of 30 is large enough when the population distribution is roughly bell-shaped. Others recommend a sample size of at least 40. But if the original population is distinctly not normal (e.g., is badly skewed, has multiple peaks, and/or has outliers), researchers like the sample size to be even larger. So i invite you to read my following thoughts about my software
project that is called PERT++, and notice that the PERT networks are referred to by some researchers as "probabilistic activity networks" (PAN) because the duration of some or all of the arcs are independent random variables with known probability distribution functions, and have finite ranges. So PERT uses the central limit theorem (CLT) to find the expected project duration.

So I have provided you in my PERT++ with the following functions:

function NormalDistA (const Mean, StdDev, AVal, BVal: Extended): Single;

function NormalDistP (const Mean, StdDev, AVal: Extended): Single;

function InvNormalDist(const Mean, StdDev, PVal: Extended; const Less: Boolean): Extended;

For NormalDistA() or NormalDistP(), you pass the best estimate of completion time to Mean, and you pass the critical path standard deviation to StdDev, and you will get the probability of the value Aval or the probability between the values of Aval and Bval.

For InvNormalDist(), you pass the best estimate of completion time to Mean, and you pass the critical path standard deviation to StdDev, and you will get the length of the critical path of the probability PVal, and when Less is TRUE, you will obtain a cumulative distribution.

So as you are noticing from my above thoughts that since PERT networks are referred to by some researchers as "probabilistic activity networks" (PAN) because the duration of some or all of the arcs are independent random variables with known probability distribution functions, and have finite ranges. So PERT uses the central limit theorem (CLT) to find the expected project duration. So then you have to use my above functions
that are Normal distribution and inverse normal distribution functions, please look at my demo inside my zip file to understand better how i am doing it:

You can download and read about my PERT++ from my website here:

https://sites.google.com/site/scalable68/pert-an-enhanced-edition-of-the-program-or-project-evaluation-and-review-technique-that-includes-statistical-pert-in-delphi-and-freepascal

Thank you,
Amine Moulay Ramdane.

0 new messages