Dan,
Thanks for the feedback! Sounds like this may not go far.
Here's another angle: before I new about CUDA_VISIBLE_DEVICES, when I ran training, it ran multiple jobs on 1 GPU, and I was limited by the GPU memory (the GPUs were in default compute mode, allowing multiple threads). I could do 3 at once with 8 GB memory. With parallel GPUs, I can now do 4, and of course 4>3.
Is there a configuration that allows running both "wide" (across multiple GPUs) and "deep" (multiple threads on a GPU), so you could do like 12 jobs at once, 3 per GPU? I tried setting CUDA_VISIBLE_DEVICES with the default compute mode, and got and out of memory error; it looked like it never knew to go to the next GPU.
Thanks!
Charles