Parallelizing WESTPA + OpenMM on SLURM

110 views
Skip to first unread message

anando...@gmail.com

unread,
Mar 24, 2025, 10:31:19 AMMar 24
to westpa-users
Hello WESTPA community,

I am working on setting up WESTPA simulations with OpenMM on a SLURM-based cluster, and I am looking to parallelize segment propagation across 4 GPUs on a single node. I am using the processes work manager and running OpenMM on one GPU, but I would like to scale up to take advantage of all 4 GPUs.

1. Could anyone with experience in GPU-parallel WESTPA help answer the following? -- How can I best assign segment propagation to multiple GPUs (e.g., 4) using the processes or zmq work manager? or Do I need to modify runseg.sh to set CUDA_VISIBLE_DEVICES per segment?
Is there a way to ensure segments do not all end up on GPU 0?

2. Is w_run --work-manager zmq better suited for this than processes, even on a single node with multiple GPUs? -- I know the ZMQ server/client setup, but I am unclear on how to map each client (worker) to a specific GPU when all run on the same node.

3. Have others encountered significant overhead with OpenMM context creation?
If so, are there known optimizations (e.g., sharing a context between segments in runseg.sh) that reduce launch time? -- Example scripts would be greatly appreciated, especially SLURM job scripts (submit.sh) and WESTPA configs (run.sh, runseg.sh) that handle multi-GPU execution cleanly.

I am happy to share my scripts if that would help clarify. I really appreciate any help you can provide. 

Leung, Jeremy

unread,
Mar 24, 2025, 12:00:59 PMMar 24
to westpa...@googlegroups.com
Hi Anand,


2. ZMQ works on both single node and multi-node settings. I personally prefer using processes on a single node, even with multi GPUs (see example) because you no longer have to jump through tcp/ip connections (i.e. ssh), which are could be slow or lossy in HPC settings. One less hoop to jump through, one less factor for jobs to fail. See links in 1. for more explanation on ZMQ.

3. Startup overhead for OpenMM tends to be short compared to the propagation time, unless you have a really short tau. You could implement shared context (say using NVIDIA MPS) and that had demonstrated speedup, but not really needed unless you're stretching system limits.

Last thing I'd like to add is: in the WESTPA context, running one segment per GPU (and running more segments/bin) is probably more beneficial/faster than trying to get a single segment to run across multiple GPUs, especially with Nvidia SLI discontinued.

Best,

Jeremy L.

---
Jeremy M. G. Leung, PhD
Postdoctoral Associate, Chemistry (Chong Lab)
University of Pittsburgh | 219 Parkman Avenue, Pittsburgh, PA 15260
jml...@pitt.edu | [He, Him, His]

--
You received this message because you are subscribed to the Google Groups "westpa-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to westpa-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/westpa-users/ed29ac21-9d12-4a8c-bc31-4ea7f89b6cc1n%40googlegroups.com.

Anupam Anand Ojha

unread,
Mar 24, 2025, 3:32:23 PMMar 24
to westpa...@googlegroups.com
Thank you Jeremy for the reply. There is much more clarity now.



--
Best regards,

A. Anand Ojha
Flatiron Research Fellow
Center for Computational Biology & Center for Computational Mathematics
Flatiron Institute, New York

Victor Montal Blancafort

unread,
Apr 21, 2025, 9:01:18 AMApr 21
to westpa-users
Dear Jeremy,

Do you have any available example of ZMQ for GPU , tho?
I have been trying to use the example here (official recommendations) with GROMACs and, from my experience, it only allocates one of the 4 GPUs available on the node using the linked tutorial
Ofc, I have tweeked the SBATCH parameters to match the number of GPUs.

Prob is a stupid error on the srun command, but having an example of srun + ZMQ for GPU would be really appreciated! I guess the optimal solution would be that srun allocs one GPU per task?

I would like to mention, tho, that using the "proceses" it looks from the GROMACs logs that it uses in parallel the 4 GPUs (related to the point3 from your answer) and drastically drops performance.

KR,
Victor

Jeremy Leung

unread,
Apr 21, 2025, 10:25:28 AMApr 21
to westpa-users
Hi Victor,

I don't think we have a ZMQ + GMX (combined) example, but I would suggest you working towards combining features from these two tutorials (ZMQ/Amber/GPU and GMX/CPU):

Looking at the GROMACS documentation, passing `-gpu_id` tag to mdrun should allow you to specify which GPU to use. I would use the `$WM_PROCESS_INDEX` variable for that.

Best,

Jeremy L.

Ashlin James Poruthoor

unread,
Apr 21, 2025, 1:57:20 PMApr 21
to westpa...@googlegroups.com

Hayden Scheiber

unread,
Apr 23, 2025, 8:24:56 PMApr 23
to westpa-users
Hi all,

I have spent a lot of time running optimized WESTPA simulations using GROMACS on multi-GPU nodes with the processes work manager. ZMQ work manager is only necessary if you want to scale to multi-node simulations, and I have done this before as well for a proof-of-concept using 4 A100 nodes with 8 GPUs each. Keep in mind ZMQ is a distributed memory paradigm (think MPI) while processes is a shared memory paradigm (think OpenMP).

Firstly, I highly recommend you read through this blogpost by Alan Gray from NVIDIA. While it is not specifically about WESTPA, it provides a framework on how to best perform many GROMACS simulations in parallel in a multi-GPU environment.

Some key takeaways:
Firstly you should map workers to specific GPUs using the CUDA_VISIBLE_DEVICES environment variable. I have also tried using the `gmx mdrun` flag of `-gpu_id`. For whatever reason, this causes GROMACS to initialize way slower, so stick to the environment variable. 

Secondly, familiarize yourself with the NUMA topology of your node using `nvidia-smi topo -m`. For optimal performance you want each GROMACS simulation to run using CPUs on the same physical NUMA node as the GPU to which it is assigned. I wrote a python function that runs `nvidia-smi topo -m` once at the start of the simulation, processes the output, and exports a GPU to NUMA map and a CPU to NUMA map, which I then use to map CPUs and a GPU to workers in runseg.sh.

Thirdly, make sure you use NVIDIA Multi-Process Service when running multiple simulations per GPU. It is very simple to use and you will hinder your performance if not used. I also recommend running your simulations with the following environment variables related to CUDA to maximize your performance with GROMACS+WESTPA:

export CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=80
export CUDA_DEVICE_MAX_CONNECTIONS=1
export CUDA_MPS_CLIENT_PRIORITY=0

Fourthly, I found that for systems of about 100-200k atoms, best throughput its achieved with 6-12 simulations per GPU, depending on the specific GPU. 

Below are example run.sh, node.sh, and runseg.sh scripts that implement either processes or ZMQ work manager for multi-GPU simulations for WESTPA+GROMACS.

run.sh

#!/bin/bash

# Make sure that WESTPA is not already running.  Running two instances of
# WESTPA on a single node/machine can cause problems.
# The following line will kill any WESTPA process that is currently running.
if pgrep -x "w_run" >/dev/null; then
    pkill -9 -f w_run
fi
if pgrep -x "w_init" >/dev/null; then
    pkill -9 -f w_init
fi

# Make sure environment is set
source env.sh

# Move to the correct folder and initialize
cd $WEST_SIM_ROOT

if [ "$WM_WORK_MANAGER" = "zmq" ]; then
    # ZMQ server info file
    SERVER_INFO=$WEST_SIM_ROOT/west_zmq_info-$NGC_JOB_ID.json

    # start zmq server
    w_run --work-manager=zmq --n-workers=0 --zmq-mode=master --zmq-write-host-info=$SERVER_INFO --zmq-comm-mode=ipc $WRUN_AUX_ARGS >> west.log 2>&1 &

    # wait on host info file up to one minute
    for ((n=0; n<60; n++)); do
        if [ -e $SERVER_INFO ] ; then
            echo -e "== server info file $SERVER_INFO ==\n"
            cat $SERVER_INFO
            break
        fi
        sleep 1
    done

    # exit if host info file doesn't appear in one minute
    if ! [ -e $SERVER_INFO ] ; then
        echo 'ZQM server failed to start'
        exit 1
    fi

    # Change the hostname of the master to its true IP such that it can be interpreted by clients (issue with DGX)
    MASTER_IP=`hostname -I | awk '{print $1}'`
    awk -v old="$HOSTNAME" -v new="$MASTER_IP" '{
    gsub(old, new);
    print;
    }' "$SERVER_INFO" > tmp.json && mv tmp.json "$SERVER_INFO"

    # Start clients, with the proper number of workers on each.
    mpirun --map-by node:PE=1 -n $NMPI_NODES $WEST_SIM_ROOT/westpa_scripts/node.sh $WEST_SIM_ROOT --work-manager=zmq --zmq-mode=client --n-workers=$WM_N_WORKERS --zmq-read-host-info=$SERVER_INFO --zmq-comm-mode=ipc $WRUN_AUX_ARGS >> west-$NGC_JOB_ID-mpirun.log 2>&1
else
    # Start MPS daemon on current node if not already running
    if ! pgrep -x "nvidia-cuda-mps" > /dev/null ; then
        nvidia-cuda-mps-control -d
    fi
   
    # Run westpa job on current node, passing given inputs
    w_run --work-manager=$WM_WORK_MANAGER --n-workers=$WM_N_WORKERS >> west.log 2>&1
fi


node.sh (used by run.sh with ZMQ work manager)

#!/bin/bash -l

if [ -n "$NODE_DEBUG" ] ; then
  set -x
  env | sort
fi

# First input should be $WEST_SIM_ROOT, as this variable is not passed to all nodes
cd $1; shift

# Re-source the westpa environment variables
source env.sh

# Report some basic info to the log
echo "starting WEST client processes on: "; hostname
echo "current directory is $PWD"

# Start MPS daemon on current node if not already running
if ! pgrep -x "nvidia-cuda-mps" > /dev/null ; then
  nvidia-cuda-mps-control -d
fi

# Run westpa job on current node, passing given inputs
w_run "$@" >> west.log 2>&1


runseg.sh

#!/bin/bash
#
# runseg.sh
#
# WESTPA runs this script for each trajectory segment. WESTPA supplies
# environment variables that are unique to each segment, such as:
#
#   WEST_CURRENT_SEG_DATA_REF: A path to where the current trajectory segment's
#       data will be stored. This will become "WEST_PARENT_DATA_REF" for any
#       child segments that spawn from this segment
#   WEST_PARENT_DATA_REF: A path to a file or directory containing data for the
#       parent segment.
#   WEST_CURRENT_SEG_INITPOINT_TYPE: Specifies whether this segment is starting
#       anew, or if this segment continues from where another segment left off.
#   WEST_RAND16: A random integer
#
# This script has the following three jobs:
#  1. Create a directory for the current trajectory segment, and set up the
#     directory for running gmx mdrun
#  2. Run the dynamics
#  3. Calculate the progress coordinates and return data to WESTPA

# Start time for the whole script
SCRIPT_START_TIME=$(date +%s.%N)

# Function to calculate and print elapsed time
print_timing() {
    local desc=$1
    local start_time=$2
    local end_time=$(date +%s.%N)
    local elapsed_time=$(echo "$end_time - $start_time" | bc)
    echo "[TIMER] $desc: ${elapsed_time} seconds"
}

# If we are running in debug mode, then output a lot of extra information.
if [ -n "$SEG_DEBUG" ] ; then
  env | sort
fi

######################## Set up for running the dynamics #######################
# Set up the temp directory where data for this segment will be calculated
ITER=$(printf "%06d" $WEST_CURRENT_ITER)
SEG=$(printf "%06d" $WEST_CURRENT_SEG_ID)
CALC_TMPDIR="$WEST_SIM_TMP/"traj_segs"/$ITER/$SEG"
mkdir -pv $CALC_TMPDIR
cd $CALC_TMPDIR

# The weighted ensemble algorithm requires that dynamics are stochastic.
# We'll use the "sed" command to replace the string "RAND" with a randomly
# generated seed.
sed "s/RAND/$WEST_RAND16/g" $MD_MDP > md.mdp

# Setup GROMACS process and GPU indexes
# WM_PROCESS_INDEX is a 0-based integer identifying the process among the set of processes started on a given node
# It is not defined for all work managers, e.g. serial, but needed for westraj map_worker
if [ -z "$WM_PROCESS_INDEX" ]; then
    export WM_PROCESS_INDEX=0
fi

# This script assigns a GPU_IDX, CPU_IDX, and NUMA_IDX for the current worker
START_TIME=$(date +%s.%N)
eval "$($PYTHON -m westraj.cli.map_worker)"
echo "WORKER_IDX: $WM_PROCESS_INDEX, GPU_IDX: $GPU_IDX, CPU_IDX: $CPU_IDX, NUMA_IDX: $NUMA_IDX, CPU_RANGE: $CPU_RANGE"
print_timing "NUMA Node, GPU, and CPU assignment" $START_TIME

# Assigns the correct GPU/CPUs for this worker
export CUDA_VISIBLE_DEVICES=$GPU_IDX
if [ "$NUMA_AFFINITY_ENABLED" = true ]; then
  NUMA_CONFIG="$NUMACTL --physcpubind=$CPU_RANGE"
else
  NUMA_CONFIG=""
fi
MDRUN_CPU_CONFIG="-ntmpi 1 -nt $OMP_NUM_THREADS -pin on -pinoffset $CPU_IDX -pinstride 1"
echo "MDRUN_CPU_CONFIG: $MDRUN_CPU_CONFIG, NUMA_CONFIG: $NUMA_CONFIG"

# Run the GROMACS preprocessor
START_TIME=$(date +%s.%N)
$NUMA_CONFIG $GMX grompp -f md.mdp -c $REF_GRO -p $TOPOL_TOP \
  -t parent.trr -o seg.tpr -po $NULL -n $INDEX_NDX
print_timing "gmx grompp" $START_TIME

############################## Run the dynamics with re-tries on failure ################################
TRY=0
while true; do
  # Propagate the segment using gmx mdrun
  START_TIME=$(date +%s.%N)
  $NUMA_CONFIG $GMX mdrun $MDRUN_CPU_CONFIG -tunepme no -update gpu -nb gpu -pme gpu \
    -pmefft gpu -bonded cpu -deffnm seg -cpt -1 -nocpnum -cpo $NULL -noconfout
  EXIT_STATUS=$?
  print_timing "gmx mdrun" $START_TIME

  if [ $EXIT_STATUS -eq 0 ]; then
    # GROMACS exited without error, so we can break out of the loop
    break
  fi
  # Archive the crashed simulation files and retry
  echo "mdrun failed with exit status $EXIT_STATUS. Copying simulation files to crash archive."
  ARCHIVE_DIR=$WEST_SIM_ROOT/crashes/$ITER/$SEG/try_$TRY
  mkdir -p $ARCHIVE_DIR
  cp -r $CALC_TMPDIR/* $ARCHIVE_DIR
 
  # If try meets or exceeds the maximum number of retries, then exit
  if [ $TRY -ge $MDRUN_RETRY_MAX ]; then
    echo "max mdrun retry attempts ($MDRUN_RETRY_MAX) reached. Exiting..."
    if [ "$WEST_SIM_TMP" != "$WEST_SIM_ROOT" ]; then
      rm -r $CALC_TMPDIR
    fi
    exit 1
  fi

  # Increment the number of tries and wait before retrying
  TRY=$((TRY + 1))
  echo "Retrying in $MDRUN_RETRY_WAIT seconds... (Attempt $TRY of $MDRUN_RETRY_MAX)"
  sleep $MDRUN_RETRY_WAIT
done

########################## Transform Coordinates ##########################
# 1: Unwrap chains with pbc whole (much faster than MDAnalysis!)
START_TIME=$(date +%s.%N)
mv seg.xtc seg_orig.xtc # prevents overwriting the original xtc file
echo "SOLU" | $NUMA_CONFIG $GMX trjconv -f seg_orig.xtc -s seg.tpr -n $INDEX_NDX -pbc whole -o seg.xtc
print_timing "gmx trajconv to remove pbc artifacts" $START_TIME

########################## Calculate and return data ###########################

# Link the ref.pdb topology file to WEST_TRAJECTORY_RETURN
ln -s $REF_PDB $WEST_TRAJECTORY_RETURN/ref.pdb

# The $CALC_PCOORD script calculates pcoords and auxdata and returns them to westpa
START_TIME=$(date +%s.%N)
$NUMA_CONFIG $CALC_PCOORD
print_timing "Calculated and returned progress coordinates" $START_TIME

# Only pass on the final frame of the trr file to the next iteration to save space
START_TIME=$(date +%s.%N)
$NUMA_CONFIG $GMX trjconv -f seg.trr -o $WEST_RESTART_RETURN/parent.trr -b 1
print_timing "gmx trjconv to truncate trajectory restart file" $START_TIME

# Return the gromacs log to westpa
cp seg.log $WEST_LOG_RETURN

# If everything ran correctly, clean up all the files that we don't need to save.
# But only if $WEST_SIM_TMP is not the same as $WEST_SIM_ROOT.
if [ "$WEST_SIM_TMP" != "$WEST_SIM_ROOT" ]; then
  rm -r $CALC_TMPDIR
fi

# Final script timing
print_timing "Total script execution time" $SCRIPT_START_TIME

Cheers,

Hayden

Victor Montal Blancafort

unread,
Apr 24, 2025, 8:38:03 AMApr 24
to westpa-users
Thanks Hayden! Really appreciate!

Some follow-up Qs:
1) I guess you dont have a working script with Slurm/srun, right? I think that from your overall run.sh, I will have to modify the mpirun part. Slurm can "automatically" assign GPUs by closest to CPUs trying to optimize NUMA, will take a look

2) Could you share information on how gromacs was compiled? Or at least, which binary are you using? gmx? gmx_mpi?

3) Do not complety understand 3 and 4 points. So you will NOT recommend to stick one GPU per segment, then? How many workers per node? In your setup of one node with 8 GPUs, which would be the optimal number? I am trying to figure out how to configure the enviroment to take advantage as much as possible from our HPC.

Thanks for your time!!
V

Hayden Scheiber

unread,
May 1, 2025, 8:51:32 PMMay 1
to westpa-users
Hi Victor,

(1) No I have not used SLURM with WESTPA, sorry. I was using DGX Cloud (and have since moved to AWS batch as well).

(2) I'm using GROMACS compiled only with thread MPI installed in a docker image. There is no reason to use gmx_mpi unless you plan to run individual instances of GROMACS across multiple nodes (i.e. very large scale simulations).
In general it is much more computationally efficient to do the inverse with WESTPA: run multiple instance of GROMACS per node :)

(3) No, I do not recommend to use one GPU per segment. To best utilize the GPU(s) and maximize your throughput, you should run multiple segments per GPU! Each individual segment will run slower as it has to share the GPU, but your overall throughput will be much higher.
How many workers per node will depend on the hardware available on your nodes. Lets take an H100 node as an example: 8 H100 GPUs and 192 CPU cores. From empirical testing, I found that 6 simulations per GPU produces maximal throughput. So I would set up my simulation with 48 total segments running in parallel, 6 simulations per GPU, with each segment assigned 4 CPU cores on the same NUMA node as its assigned GPU. With this setup I see consistent 100% GPU utilization and very high CPU utilization across all CPUs.

Finding the optimal number of simulations per GPU is something you'll have to check empirically on your HPC nodes with timing tests: run a few parallel simulation test cases with varying number of simulations per GPU and see which produces the largest total throughput.
The optimal number may vary a bit with system size as well. The aim is to strike a balance between keeping the GPUs fully utilized, while not completely overwhelming them with too many segments, causing excessive overhead.

Here is the dockerfile I use to install GROMACS, its built atop the latest pytorch container from NVIDIA, so comes pre-build with CUDA and pytorch (which is useful if you want to do downstream analysis involving neural networks).

############################
# 1) Builder Stage
############################
FROM nvcr.io/nvidia/pytorch:25.03-py3 AS builder

############################
# Build FFTW
############################
ARG FFTW_VERSION=3.3.10
WORKDIR /tmp/fftw
RUN wget --no-check-certificate ftp://ftp.fftw.org/pub/fftw/fftw-${FFTW_VERSION}.tar.gz && \
    tar -xf /tmp/fftw/fftw-${FFTW_VERSION}.tar.gz && \
    cd /tmp/fftw/fftw-${FFTW_VERSION} && \
    CC=gcc CFLAGS='-O3 -pipe' \
    CXX=g++ CXXFLAGS='-O3 -pipe' \
    FFLAGS='-O3 -pipe' \
    LDFLAGS=-Wl,--as-needed \
    ./configure --prefix=/usr/local/fftw --enable-avx --enable-avx2 \
        --enable-float --enable-shared --enable-sse2 --enable-threads && \
    make -j"$(nproc)" && \
    make -j"$(nproc)" install && \
    cd / && rm -rf /tmp/fftw

############################
# Build GROMACS
############################
ARG GMX_VERSION=2025.1
WORKDIR /tmp/gromacs
RUN wget --no-check-certificate ftp://ftp.gromacs.org/gromacs/gromacs-${GMX_VERSION}.tar.gz && \
    tar -xf gromacs-${GMX_VERSION}.tar.gz && \
    cd gromacs-${GMX_VERSION} && mkdir build && cd build && \
    CC=gcc CFLAGS='-O3 -pipe' \
    CXX=g++ CXXFLAGS='-O3 -pipe' \
    FFLAGS='-O3 -pipe' \
    LDFLAGS=-Wl,--as-needed \
    cmake \
      -DCMAKE_INSTALL_PREFIX=/usr/local/gromacs/avx2_256 \
      -DGMX_SIMD=AVX2_256 \
      -DCMAKE_BUILD_TYPE=Release \
      -DCMAKE_C_FLAGS_RELEASE='-O3 -pipe -DNDEBUG' \
      -DCMAKE_CXX_FLAGS_RELEASE='-O3 -pipe -DNDEBUG' \
      -DREGRESSIONTEST_DOWNLOAD=OFF \
      -DBUILD_SHARED_LIBS=ON \
      -DGMX_GPU=CUDA \
      -DGMX_OPENMP=True \
      -DGMX_FFT_LIBRARY=fftw3 \
      -DFFTWF_LIBRARY=/usr/local/fftw/lib/libfftw3f.so \
      -DFFTWF_INCLUDE_DIR=/usr/local/fftw/include \
      -DGMX_BUILD_OWN_FFTW=OFF \
      -DGMX_BUILD_OWN_BLAS=ON \
      -DGMX_BUILD_OWN_LAPACK=ON \
      -DGMX_DOUBLE=OFF \
      -DGMX_X11=OFF \
      -DGMX_THREAD_MPI=ON \
      -DGMXAPI=OFF \
      -DGMX_CUDA_TARGET_SM='80;86;89;90' \
      ../ && \
    cmake --build . --target all -- -j$(nproc) && \
    cmake --build . --target install -- -j$(nproc) && \
    cd / && rm -rf /tmp/gromacs


############################
# 2) Final Stage
############################
FROM nvcr.io/nvidia/pytorch:25.03-py3

# Copy over FFTW & GROMACS from the builder
COPY --from=builder /usr/local/fftw /usr/local/fftw
COPY --from=builder /usr/local/gromacs /usr/local/gromacs

# Ensure FFTW is in the library path
ENV LD_LIBRARY_PATH=/usr/local/fftw/lib:${LD_LIBRARY_PATH}

# Source GROMACS on login
ENV HOME="/root"
RUN echo ". /usr/local/gromacs/avx2_256/bin/GMXRC && cd ~" >> ${HOME}/.bashrc

# System utilities
RUN apt-get update -y && \
    DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
        libhwloc-dev htop zip poppler-utils ffmpeg sshfs bc \
        curl openssh-client openssh-server && \
    rm -rf /var/lib/apt/lists/*

# unset PIP_CONSTRAINT
ENV PIP_CONSTRAINT=

# Python libraries
RUN echo rm -f /etc/pip/constraint.txt && \
    pip install --no-cache-dir --default-timeout=100 \
    torch-scatter torch-sparse torch-cluster torch-spline-conv \
    torchist accelerate "scipy>1.15" \
    jupyterlab seaborn plotly ipympl nglview biopython \
    MDAnalysis mdtraj awscli boto3 pyedr mpi4py gudhi \
    westpa

# Ensure the working directory is /root
WORKDIR /root

# Default command
CMD ["/bin/bash"]


Cheers,

Hayden 

Victor Montal Blancafort

unread,
May 2, 2025, 4:51:57 PMMay 2
to westpa-users
Wow,  Hayden, really appreciate your detailed answer!
Will take a look to everything :)

So far I manage, adapting your scripts and SLURM default variables to make it work in a 4 GPU node, for 8 segments (2 per GPU). Will test with more segments per GPU to see the overall performance,

Again, thanks!

V
Reply all
Reply to author
Forward
0 new messages