ADDA on windows

35 views
Skip to first unread message

Michel GROSS

unread,
May 30, 2024, 4:22:49 PM5/30/24
to ADDA questions and answers

Dear Maxim,

Very happy to receive your response. Of course, I am entirely ready to continue the discussion. As with many software programs available on GitHub, the authors often use Linux and GNU tools along with command-line compilations. Some authors, whom you know well, go further and only code in Fortran under Linux and refuse to do anything under Windows.

As a user, the Linux command-line approach is inconvenient. Everything is based on a script that is completely incomprehensible except to the person who wrote it. If everything works, that's fine, but the novice that I am learns nothing and therefore doesn't understand why it worked. But very often, a detail causes the project construction to fail, and since the novice has learned nothing, he doesn't know how to fix it. Moreover, as an occasional programmer, I use Visual Studio with a Microsoft compiler which, for me, is by far the most powerful tool. That's why I found it simpler, more reliable, and especially more instructive to compile ADDA from scratch with Visual Studio. With the help of ChatGPT doing that is now possible.

Having code in an IDE like Visual Studio 2019 allows you to execute the code step by step, examine the variables, and thus understand how the software works and what it does. If it’s simply about creating a functional executable, it’s easier to get the program in executable form. It's simpler.

During my initial attempts with the Microsoft MSVC compiler, I encountered a multitude of errors because the C99 option is not available in MSVC. Upon closer inspection, the problem seems to stem from the handling of complex operators (+, -, *, /) which is not supported with the complex type defined in <complex.h>. However, the type complex<Type> or double complex <type> defined in <complex> should be usable. Fortunately, the documentation indicates that the Intel C compiler with C99 can work. The Intel compiler works but the Visual Studio tool is then much less powerful.

To illustrate the power of Visual Studio 2019, let's take the example of the problem posed by "_isatty". In which header is "_isatty" found? Impossible to know if the documentation is not up to date. To find out, I selected the MSVC compiler in Visual Studio 2019 (even if it generates multiple errors), I right-clicked on the "_isatty" identifier and in the drop-down menu that appeared, I asked to display the declaration of the "_isatty" function. Visual Studio 2019 opened the "corecrt_io.h" file and displayed the function declaration "_isatty". Thus, I knew that "_isatty" was declared in the <corecrt_io.h> header.

I did the same to find the header where UINT32_MAX was declared.

Regarding the export of functions in a DLL, I asked ChatGPT how to export subroutine BJNDD (n, x, bj, dj, fj) and subroutine propaespacelibreintadda(Rij, k0a,). ChatGPT indicated that I needed to add  !DEC$ ATTRIBUTES DLLEXPORT :: BJNDD and !DEC$ ATTRIBUTES DLLEXPORT :: propaespacelibreintadda.

There is a very useful tool under Windows: dllexp.exe "https://www.nirsoft.net/utils/dll_export_viewer.html". This tool provides the list of functions exported in a Windows DLL. This tool shows that a DLL created by IFORT or IFX exports only a very small number of functions, and that these functions are capitalized, whereas mingw64 with gfortran exports all functions, leaves lowercase letters, and adds an underscore.

Regarding oclkernels.cl, I saw the error but didn’t understand it. I asked ChatGPT for help. ChatGPT wrote the "readKernelSource" function and modified the code. I saw that it worked, and I didn't investigate further.

Finally, putting code on GitHub is good, but offering codes with project files for IDEs like Visual Studio would be very useful to users. Only step-by-step execution in an IDE allows for understanding.

Look at what Micromanager does.

Best regards,

Michel GROSS

Maxim Yurkin

unread,
Jun 3, 2024, 6:14:12 AM6/3/24
to adda-d...@googlegroups.com
Dear Michel,

First, I fully agree that using an IDE gives a lot benefits. That's why several ADDA developers, including myself, have
been using Eclipse for many years (it allows everything that you mentioned with regards to Visual Studio). This is
described at https://github.com/adda-team/adda/wiki/UsingEclipse , but has not been mentioned in many other places,
since I always believed that most ADDA users want to just compile and run ADDA, or (even better) just run. Now I have
mentioned IDEs in a few more places, and referred to your guide for Microsoft VS, but it is still on the main ADDA page.

I have also updated ADDA code to solve two issues that you uncovered (problem with reading kernel sources at runtime -
https://github.com/adda-team/adda/issues/332 - and inclusion of <stdint.h> in oclcore.c). The latter is good practice
(for full portability), but <stdint.h> was already effectively included by <CL/cl.h> through intermediate cl_platform.h.
This is kind of strange that VS did not found it.

Concerning the _isatty - I still believe that currently included headers are sufficient. Even if VS complains about it,
please try compiling it as is. The Intel compiler should be OK with it (as it strictly conforms to C99 standard).

Finally, the most complicated part is compilation of Fortran sources to the DLLs. While your simple solution works,
there are several reasons why I am reluctant to make it the default for ADDA:
1) DLLs are OS-specific and seems to be an overkill for specific task (here we just need to build a single ADDA
executable, not to have a shared library to be used by many programs). The currently used linking of .obj files works
also on Windows, even when using Intel compiler in combination with VS build tools (as described in iw_compile.bat).
However, I agree that it may be not easy to setup it inside the Visual Studio project.
2) although the proposed additions to the Fortran files are comments, they still may cause compiler warnings
https://fortran-lang.discourse.group/t/standard-conforming-way-of-doing-dec-attributes-dllexport/6523 .
3) I prefer not to touch legacy Fortran files, since once you do it, you need to take some responsibility for them (as
for C code now). And this requires one to understand different Fortran standards and specifically aim for the one in
source files (that is far beyond my current knowledge). Otherwise, there is a danger that some issues will appear when
somebody compiles the code on some system, for which we never tested. This issues can potentially be more serious than
above-mentioned compiler warnings.
4) As an example, aiming at a particular standard, may involve significant amount of effort - see
https://community.intel.com/t5/Intel-Fortran-Compiler/my-Fortran-DLLs-for-my-students/m-p/1453541/highlight/true#M164909
. This is something that I am not ready to do.

To conclude, I ask you to try compile the current ADDA source, using
ifx for Fortran sources (as is)

icx for C sources with preprocessor options (additionally to some VS-specific ones):
NO_GITHASH
propaespacelibreintadda_=PROPAESPACELIBREINTADDA
bjndd_=BJNDD
(for ocl mode additionally)
OPENCL
OCL_READ_SOURCE_RUNTIME

and then link them together using icx.

However, if that it too complicated (in a VS project setup), you may surely keep the DLL approach for Fortran.

The above does not discuss optimization-related compilation flags (to get maximum performance), but that is good (to
avoid them) if you want to debug the code. If you are interested in those, please look inside iw_compile.bat.

Maxim.


Michel GROSS

unread,
Nov 17, 2025, 5:25:25 AM (12 days ago) Nov 17
to ADDA questions and answers
Dear Maxim,

As you probably know, I worked with Patrick Chaumet to help Patrick use the GPU and CUDA in the FORTRAN IFDDA code. On my computer  equipped with a  Titan V GPU I managed to achieve a 7× speed-up in computation time compared to Patrick’s big machine with many cores. After acquiring a more powerful GPU, Patrick was able to obtain an even larger improvement (10× or more). You have all of this, since you are a co-author of the paper describing these results.

I tried to do similar work with ADDA, and I now have a version of ADDA that runs somewhat faster on my computer (a factor of 2 to 2.5) than the  corresponding ADDA OpenCL code. I carried out this adaptation without fully understanding the subtleties of ADDA’s code, and therefore kept ADDA’s  internal logic—particularly the 3-step 3D FFT (FFT along X, transposition, then FFT along Y and Z). CUDA gives a small improvement over  OpenCL (about 20%) for the computations performed in Matvec, but the main speed-up came from running all computations of the BICGSTAB iteration  on the GPU. For now, I have only converted the BICGSTAB code, but converting the other iterative methods can be done easily and quickly.

The transfer to CPU computations is done within iterative solver operation , and the results are copied back at the end of the solver with:

nMult_mat_ptr(&pvec, &xvec, cc_sqrt, &material, local_nvoid_Ndip);
copy_gpu_to_cpu_doublecomplex(&pvec);

I thus avoid the CPU–GPU copies performed at the beginning and end of matvec, as well as the CPU computations of BICGSTAB, which account for the majority of the computation time in the OpenCL version of ADDA.

From a technical standpoint, this work is rather challenging because it involved modifying code that I do not fully understand, without breaking it. To achieve this, I decided to carry out the conversion using Visual Studio 2019 or 2022 (and not Visual Studio Code), relying on Microsoft’s tools (compiler, IntelliSense, find/replace, debugger, etc.), because from experience I know these tools are better suited for this kind of work.

ADDA uses C99 for doublecomplex. But C99 is incompatible with the Microsoft compiler. I tried using the LLVM (clang) compiler within Visual Studio, but I realized that the other tools (IntelliSense, find/replace, debugger, etc.) worked less effectively with Visual Studio. I therefore started from ADDA version 12, which does not use C99 doublecomplex. The code is a bit longer and the equations slightly heavier, but AI is able to convert everything to CUDA without difficulty.

To be able to perform computations in parallel on both the CPU and the GPU, I created a type…


typedef struct {
doublecomplex* cpu;
cuDoubleComplex* gpu;
    int nb;
...
}doublecomplex_ptr;

replacing all doublecomplex* array with doublecomplex_ptr.

For example, in vars.c I have:

doublecomplex_ptr xvec;  // total electric field on the dipoles
doublecomplex_ptr pvec;  // polarization of dipoles, also an auxiliary vector in iterative solvers
doublecomplex_ptr  Einc;    // incident field on dipoles

and in  calculator.c  I have:

MALLOC_VECTOR(xvec.cpu, complex, local_nRows, ALL); xvec.nb = local_nRows;
MALLOC_VECTOR(rvec.cpu,complex,local_nRows,ALL);  rvec.nb = local_nRows;
MALLOC_VECTOR(pvec.cpu,complex,local_nRows,ALL);  pvec.nb = local_nRows;
MALLOC_VECTOR(Einc.cpu,complex,local_nRows,ALL);   Einc.nb = local_nRows;
MALLOC_VECTOR(Avecbuffer.cpu,complex,local_nRows,ALL);  Avecbuffer.nb = local_nRows;

Here xvec.nb = local_nRows; gives the information of the array number of elements to  doublecomplex_ptr xvec;  This information is usefull further to allocate 
the GPU pointer and to copy CPU data to GPU.

I then created small functions that perform the computations without relying on global variables.

This code is extracted from matvec.c, ..


     for (i=0;i<local_nvoid_Ndip;i++) {
// fill grid with argvec*sqrt_cc
j=3*i;
mat=material.cpu[i];


//index=IndexXmatrix(position[j],position[j+1],position[j+2]);
// positions
size_t x = (size_t)position.cpu[j];
size_t y = (size_t)position.cpu[j + 1];
size_t z = (size_t)position.cpu[j + 2];

// inline index
size_t index = (z * smallY + y) * gridX + x;

// Xmat=cc_sqrt*argvec
for (Xcomp=0;Xcomp<3;Xcomp++)
cMult(cc_sqrt[mat][Xcomp],argvec.cpu[j+Xcomp],Xmatrix.cpu[index+Xcomp*local_Nsmall]);
}


It uses many global variables. By relying on the compiler errors, I replaced it with a call to the function fill_Xmatrix, which reproduces the same lines from the beginning of the code.

void fill_Xmatrix(doublecomplex_ptr* Xmatrix_,
    doublecomplex_ptr *argvec_,
    doublecomplex cc_sqrt_[MAX_NMAT][3],
    uchar_ptr material_,
    ushort_ptr  position_,
    int local_nvoid_Ndip,
    int local_Nsmall,
    int gridX,
    int smallY)
{
   
    int kk = 1;

        if (Xmatrix_->cpu != NULL && argvec_->cpu != NULL) {

            for (int i = 0; i < local_nvoid_Ndip; i++) {
                // fill grid with argvec*sqrt_cc
                int j = 3 * i;
                unsigned char mat = material_.cpu[i];


                //index=IndexXmatrix(position[j],position[j+1],position[j+2]);
                // positions
                size_t x = (size_t)position_.cpu[j];
                size_t y = (size_t)position_.cpu[j + 1];
                size_t z = (size_t)position_.cpu[j + 2];

                // inline index
                size_t index = (z * smallY + y) * gridX + x;

                // Xmat=cc_sqrt*argvec
                for (int Xcomp = 0; Xcomp < 3; Xcomp++)
                    cMult(cc_sqrt_[mat][Xcomp], argvec_->cpu[j + Xcomp], Xmatrix_->cpu[index + Xcomp * local_Nsmall]);
            }

        }
}    

The parameters of the fill_Xmatrix function provide all the necessary information, and since the code is fairly short, the AI can clean up the function perfectly, use another complex type if needed—such as the complex numbers from cuComplex.h—and convert it to CUDA.

In the same way, I converted the functions that perform the computations in iterative.c, such as…


void nIncrem10_cmplx(doublecomplex *  a,doublecomplex *  b,const doublecomplex c,
double *  inprod,TIME_TYPE *comm_timing)

with fonctions that performs the same calculation as

void nIncrem01_ptr(doublecomplex_ptr* aa,
doublecomplex_ptr* bb,
const COMPLEX2 c,
DOUBLE2* inprod,
TIME_TYPE* comm_timing,
int local_nRows);


Here,  doublecomplex c has been replaced by  COMPLEX2 c, and double *  inprod by
DOUBLE2* inprod with:

typedef struct {
cuDoubleComplex cpu, gpu;
}COMPLEX2;

typedef struct {
double cpu, gpu;
}DOUBLE2;

The function void nIncrem01_ptr can perform the computations in parallel on both the CPU and the GPU by using CPU and GPU coefficients. This makes it possible to carry out all computations simultaneously on the CPU and GPU, and therefore verify that everything is correct.



While converting the code, I encountered a few difficulties related to C and to the choice of complex number definitions. In ADDA, doublecomplex types are arrays of two doubles:

typedef double doublecomplex[2];

This means that doublecomplex variables are simply pointers to doubles, and that C’s rather lax type checking does not catch all errors, especially in function calls. To address this, I renamed Matvec.c and iterative.c to Matvec.cpp and iterative.cpp, so that I could use the C++ compiler, which is much stricter and therefore safer. In order to mix C and C++ within the project, all functions and all global variables were declared as extern "C", by placing at the beginning of each header:


#ifdef __cplusplus
extern "C" {
#endif

....



#ifdef __cplusplus
}
#endif

By using this method and relying on AI, the conversion can be done without too much difficulty.

What should be done now?

Of course, I will convert the other computation methods to CUDA like

ITER_FUNC(BCGS2);
ITER_FUNC(BiCG_CS);
ITER_FUNC(BiCGStab);
ITER_FUNC(CGNR);
ITER_FUNC(CSYM);
ITER_FUNC(QMR_CS);
ITER_FUNC(QMR_CS_2);

and this can be done without any difficulty.

AI should also make it possible to convert the CUDA code back into OpenCL code without trouble, with the benefit of running the entire iteration on the GPU.

However, the computations done with ADDA remain much slower than the equivalent computations in IFDDA, most likely because the 3D FFTs are performed in three steps, with slicing followed by the Transpose_YZ operation—which is very inefficient on the GPU because memory addresses are not contiguous. For computations without MPI, using a 3D FFT optimized by cuFFT would undoubtedly be much faster. I therefore plan to replace the 3-step 3D FFT computations in ADDA with direct 3D FFTs implemented with FFTW and cuFFT. As for MPI computations, NVIDIA has, I believe, provided libraries that would improve things, but I do not know more about them, and I understand that the design choices made in ADDA were guided by MPI constraints.

For the direct 3D FFT computations, I may need a bit of assistance. The simplest approach is probably to implement the same 3D FFT for both matvec and green.

Michel Gross

Maxim Yurkin

unread,
Nov 18, 2025, 12:09:03 PM (11 days ago) Nov 18
to adda-d...@googlegroups.com
Michel, thanks a lot for all your work. That is really impressive!

I agree that making it work for all iterative solvers would be great, as well as testing for 3D FFT. With respect to the latter, the paper that we are preparing seems to show, that 1D FFT is clearly beneficial on the CPU, but indeed I do not know what happens on the GPU.

It's a pity that you had to downgrade to ADDA 1.2 to implement the cuda code, still your code is definitely useful and should be used (at least in parts) for future ADDA development. Therefore, could you please put your code on Github? For that, first fork adda repository to your account, then create a branch based on version 1.2, then replace the modified files and push these changes to github. This will make your contribution "official" and greatly ease any further merges of the code.

Then, it would also be easy to test the code using some scripts, for example the one that Clement Argentin developed for comparing different DDA codes (will be online soon). I am specifically interested in comparison of different GPU modes, but many of them - which helps to make more reliable conclusions of what is the main acceleration factors and bottlenecks. This is a continuation of our previous discussion - https://groups.google.com/g/adda-discuss/c/7QJgukAv_Ho , but see also the attached figure (for two Nvidia GPUs available at CRIANN cluster). It would be interesting to add ADDA-CUDA to this graph. Comparing to IFDDA timing on the same GPU, we can also answer the 1D vs. 3D FFT question (since that is the main algorithmic difference between the codes).

Maxim.
--
You received this message because you are subscribed to the Google Groups "ADDA questions and answers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to adda-discuss...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/adda-discuss/db66f723-5801-4ab9-816a-d1f12bc1e719n%40googlegroups.com.


Appendix_Figure1.png

Michel GROSS

unread,
Nov 18, 2025, 4:12:19 PM (11 days ago) Nov 18
to adda-d...@googlegroups.com

Dear Maxim,

The choice of ADDA12 is not important, because my goal was simply to work with appropriate tools without wasting time on technical details. Once the ADDA12 version is stable, it will be possible to adapt the latest version of ADDA in the same way. I believe everything should be compiled in C++, without relying on C99 complex types. C++ is essential for working safely, and it provides alternative complex number types that make the equations much easier to read. CUDA, in any case, uses its own complex types such as cufftComplex or cuDoubleComplex. But these are minor details.

Now that everything is working on my side, I am beginning to better understand the internal logic of ADDA. I think it is possible to significantly accelerate GPU computation (typically by a factor of 10). For FFTs of size 256³, the transposition step takes about 20 ms, whereas a single FFT along x, y, or z takes only 1 ms, and a full 3D FFT takes 3 ms. So the transposition is extremely expensive. The other very costly operation, probably as expensive as the transposition, is slicing along the fast axis x. This choice was dictated by MPI constraints, but it makes GPU computation very inefficient.

It is possible to go much faster either by performing full 3D FFTs (which requires more GPU memory), or by choosing other axes for zero-padding and slicing, thereby eliminating all transpositions. The z-axis is the least expensive (zero-padding on z, FFT on z, trivial slicing on z, then 2D FFT on yz), and the y-axis is almost as good while remaining compatible with MPI.

This is where I currently stand. I now want to be able to change the choice of FFT axes by modifying Matvec and the Green function computation, and verify that the results remain identical. Only after that will I introduce CUDA and measure performance.

Michel



Reply all
Reply to author
Forward
0 new messages