Thanks, Andreas.
Here are the various interactive requests:
qsub -I -A stf006 -l walltime=00:40:00 -l nodes=1
qsub -I -A stf006 -l walltime=00:40:00 -l nodes=2
qsub -I -A stf006 -l walltime=00:40:00 -l nodes=3
qsub -I -A stf006 -l walltime=00:40:00 -l nodes=4
Here is an example of the environment set-up in the work dir aprun call:
module load cudatoolkit
module load fftw
module swap PrgEnv-pgi/5.2.82 PrgEnv-gnu
export CRAY_CUDA_PROXY=1
aprun -n 3 /ccs/home/adaa/cp2k-titan-gnu4.9.3/cp2k/exe/titan_gnu_cuda/cp2k.psmp -i H2O-32.inp -o H2O-32.out &
(Also with various other values for -n, and also with/without export OMP_NUM_THREADS=8, as in the CUDA profiling part of the website)
The ldd paths to the executable are identical within a qsub on the work node and on the compile node, except the hexadecimal addresses.
By the way, I built 4.1 on Eos, also a Cray but with no GPUs, and with psmp but no CUDA, and there are no problems running on multiple nodes.
Thanks again,
Ada