Dear Paul,
Thanks for your help.
So this is where I am now
Some of my jobs run correctly, i.e. my program reads a big file containing some data and then performs some operations on this data.
Some of my jobs crash, very certainly for memory configuration reasons.
Details
What I did then is to reduce a bit all the values I am using for the memory configuration, so I did this:
export GASNET_MAX_SEGSIZE=53000MB //I tried many value here...does not seem to be the source of the problem. When in doubt I use 3000MB as a value for this variable.
export UPCXX_SEGMENT_MB=2000
export GASNET_PHYSMEM_MAX=54GB
GASNET_PHYSMEM_NOPROBE=1
mpirun -n 32 upcxxProgram/upcxxSpmv ../dataset/D90MPI3Dheart.57 100
And everything runs perfectly...but this is unsatisfying because clearly I am far from being able to use the available physical memory.
UPCXX_SEGMENT_MB seems to be the source of the problems:
When I use a value equal or inferior to 2000MB everything works fine.
But, I am using 16 cores per node.
Each node has 64GB of RAM.
If I can use "only" 85% of this amount of RAM: 54GB then each thread/process should be able to use around 3300MB of RAM.
Now, I did my test using interactive login, using 2 nodes.
So in theory I have 64GBx2 of RAM available. Or 108GB (2x54GB) in practice.
So having 2000MB of RAM per thread is far from 108GB.
For info:
The CPU on the super computer I am using have Hyper-Threading capability, but this is supposed to be disabled on compute nodes.
So I checked carefully that the node where I am launching mpirun -n 32, runs "only" 16 threads (i.e. the other 16 threads, must be on the other node that I cannot "see" in the interactive login)
My question:
What is your advice to solve this memory configuration issue?