Hello Xyce team,
I am experiencing an issue possibly related to openmpi implementation.
I have this machine: HP ProLiant DL580 G9
It has 4 processors. Each processor is an Intel Xeon E5-8880 v4
All 4 processors are a part of the same motherboard. They are NOT connected via ethernet and are not treated like nodes.
The processor has 22 cores 2 threads per core so 44 slots per processor.
For simplicity, slots 0-43 are associated with processor 0, 44-87 with processor 1, etc.
When I run mpirun with varying number of slots (4, 8 16, 32 etc.) and I use cpu-set within one processor (ex 0-43), I see my simulation results scale with number of slots. However, when I cross between two processors, the simulation time falls flat. ex. mpirun -n 4 --cpu-set 42-45 vs mpirun -n 4 --cpu-set 0-3 the difference is 300x.
The specific syntax I use is the following:
mpirun -n <num_nodes> --use-hwthread-cpus --nooversubscribe --cpu-set <cpu_set> Xyce <filename.spice> -hspice-ext all
I ran the mpi regression test and it didn't show any issues.
the versions of tools I'm running
xyce parallel 7.8
trilinos 12.12.1
openmpi 4.1.2
gcc 11.4.0
OS: ubuntu 22.04.3 LTS
machine:
https://www.hpe.com/psnow/doc/c04601208.html?jumpid=in_lit-psnow-redprocessor: Intel(R) Xeon(R) CPU E7-8880 v4 @ 2.20GHz
configuration : 2.5TB memory, 5TB storage
Is this something others have seen?
Thanks,
Eddy