AWS GPU problem

186 views
Skip to first unread message

miken...@gmail.com

unread,
Nov 9, 2016, 10:16:37 AM11/9/16
to kaldi-help
This is not strictly speaking a Kaldi question, though it might have affected other Kaldi users.

I am running training on an AWS cluster using g2.8xlarge instances. Several times in the last few days I have hit this error message:

>>>
LOG (nnet3-chain-train:SelectGpuId():cu-device.cc:146) num-gpus=3. Device 0: all CUDA-capable devices are busy or unavailable.  Device 1: all CUDA-capable devices are busy or
unavailable.  Device 2: all CUDA-capable devices are busy or unavailable. 
ERROR (nnet3-chain-train:SelectGpuId():cu-device.cc:147) Failed to create CUDA context, no more unused GPUs?
Failed to create CUDA context, no more unused GPUs?
<<<

If I do 'nvidia-smi' it shows that the machine has only 3 GPUs, instead of 4.

When we poke around, lspci tells us:
>>>
00:03.0 Unassigned class [ffff]: NVIDIA Corporation GK104GL [GRID K520] (rev ff)
00:04.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)
00:05.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)
00:06.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)
<<<

We suspect that this is rogue hardware, that we will report.
(I also suspect that if I kill this machine and bring up another, some fraction of the time we get the same machine back, which is annoying)

Does anyone have any thoughts on this, or suggestions for something we should try?

Daniel Povey

unread,
Nov 9, 2016, 12:26:14 PM11/9/16
to kaldi-help
I haven't seen this particular error before.
However, you should expect regular errors of various kinds if you are running a larger cluster of GPUs.  They're not as reliable as CPUs and the drivers are not always fully debugged.
Dan


--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Danny Lloyd

unread,
Nov 11, 2016, 3:40:53 PM11/11/16
to kaldi-help
I haven't had this occur, though I might be able to help diagnose it with a bit more info. Which distribution / driver / CUDA version? 

And to clarify, you have tried stopping or terminating the instance, then launching again? And you've only periodically gotten this failure? Have you tried launching in a separate availability zone?

If your configuration is working for 3/4, that hints towards a hardware issue, but it might still be drivers, as Dan said. Since the g2.8xlarge instances use 4 separate K520s, I suspect that they're the same K520s that g2.2xlarge instances access individually. You might have one that previously had a separate driver installed via a g2.2xlarge, and was not cleaned properly.
Reply all
Reply to author
Forward
0 new messages