zeroth 구동 시 GPU 인식 실패 현상이 있어서 stage 12에서 중단 됩니다.
GPU / CPU stress 테스트 시에는 해당 현상이 나지 않는 것 같아서 혹시 zeroth 구동 시 확인해야할 사항이 있을 까 해서 공유합니다.
8개 GPU 사용중에서 최초 GPU 인식 실패 후 한개씩 인식 실패하여 모두 인식 실패합니다.
1. zeroth 중지 로그
> vi nohup.out
steps/nnet3/chain/get_egs.sh: Finished preparing training examples
2022-03-02 14:04:00,697 [steps/nnet3/chain/train.py:431 - train - INFO ] Copying the properties from exp/chain_rvb/tdnn1n_rvb/egs to exp/chain_rvb/tdnn1n_rvb
2022-03-02 14:04:00,753 [steps/nnet3/chain/train.py:445 - train - INFO ] Computing the preconditioningmatrix for input features
2022-03-02 14:04:07,877 [steps/nnet3/chain/train.py:454 - train - INFO ] Preparing the initial acoustic model.
2022-03-02 14:04:08,832 [steps/nnet3/chain/train.py:488 - train - INFO ] Training will run for 4.0 epochs = 14169 iterations
2022-03-02 14:04:08,835 [steps/nnet3/chain/train.py:535 - train - INFO ] Iter: 0/14168 Jobs: 2 Epoch: 0.00/4.0 (0.0% complete) lr: 0.003000
run.pl: job failed, log is in exp/chain_rvb/tdnn1n_rvb/log/train.0.1.log
> vi exp/chain_rvb/tdnn1n_rvb/log/train.0.2.log
ERROR (nnet3-chain-train[5.5.1009~2-e4940]:SelectGpuId():cu-device.cc:181) No CUDA GPU detected!, diagnostics: cudaError_t 100 : "no CUDA-capable device is detected", in cu-device.cc:181
[ Stack-Trace: ]
/home/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0xb0b) [0x7f56d4a19823]
nnet3-chain-train(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x11) [0x4108e7]
/home/kaldi/src/lib/libkaldi-cudamatrix.so(kaldi::CuDevice::SelectGpuId(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x402) [0x7f56d6545e58]
nnet3-chain-train(main+0x482) [0x40f6c9]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f5695008555]
nnet3-chain-train() [0x40f199]
kaldi::KaldiFatalError
# Accounting: time=8 threads=1
# Ended (code 255) at 2022. 03. 02. (수) 14:04:23 KST, elapsed time 8 seconds
2. /var/log/messages
Feb 28 10:43:23 localhost kernel: NVRM: GPU at PCI:0000:de:00: GPU-2ee13725-4b47-ba34-fcb0-6204c08d5be3
Feb 28 10:43:23 localhost kernel: NVRM: Xid (PCI:0000:de:00): 62, pid=20703, 0000(0000) 00000000 00000000
Feb 28 10:43:23 localhost kernel: sched: RT throttling activated
Feb 28 10:44:31 localhost kernel: NVRM: GPU 0000:de:00.0: RmInitAdapter failed! (0x23:0xffff:1401)
Feb 28 10:44:31 localhost kernel: NVRM: GPU 0000:de:00.0: rm_init_adapter failed, device minor number 7
...
Mar 2 14:22:18 localhost kernel: NVRM: GPU 0000:1b:00.0: rm_init_adapter failed, device minor number 0
Mar 2 14:22:18 localhost kernel: NVRM: GPU 0000:1c:00.0: RmInitAdapter failed! (0x23:0xffff:1401)
Mar 2 14:22:18 localhost kernel: NVRM: GPU 0000:1c:00.0: rm_init_adapter failed, device minor number 1
Mar 2 14:22:18 localhost kernel: NVRM: GPU 0000:1c:00.0: RmInitAdapter failed! (0x23:0xffff:1401)
Mar 2 14:22:18 localhost kernel: NVRM: GPU 0000:1c:00.0: rm_init_adapter failed, device minor number 1
Mar 2 14:22:18 localhost kernel: NVRM: GPU 0000:1d:00.0: RmInitAdapter failed! (0x23:0xffff:1401)
Mar 2 14:22:18 localhost kernel: NVRM: GPU 0000:1d:00.0: rm_init_adapter failed, device minor number 2
3. GPU 부하 테스트 => GPU 인식 실패 하지않음
2022. 03. 04. (금) 14:31:30 KST
utilization.gpu [%]
100 %
100 %
100 %
100 %
100 %
100 %
100 %
100 %
CPU / GPU 온도 체크
2022. 03. 04. (금) 15:26:07 KST
Package id 0: +39.0°C (high = +92.0°C, crit = +102.0°C)
Package id 1: +38.0°C (high = +92.0°C, crit = +102.0°C)
CPU Max Temp : 39°C
Check GPU Temp...
GPU Current Temp : 70 C
GPU Current Temp : 62 C
GPU Current Temp : 66 C
GPU Current Temp : 67 C
GPU Current Temp : 66 C
GPU Current Temp : 69 C
GPU Current Temp : 66 C
GPU Current Temp : 67 C
4. CPU 부하 테스트 => GPU 인식 실패 하지않음
s-tui 사용
yum install stress
CPU 전체코어 100% 25시간 동안 실행
2022. 03. 04. (금) 16:45:26 KST
Package id 0: +70.0°C (high = +92.0°C, crit = +102.0°C)
Package id 1: +67.0°C (high = +92.0°C, crit = +102.0°C)
CPU Max Temp : 70°C
Check GPU Temp...
GPU Current Temp : 36 C
GPU Current Temp : 33 C
GPU Current Temp : 33 C
GPU Current Temp : 34 C
GPU Current Temp : 33 C
GPU Current Temp : 34 C
GPU Current Temp : 33 C
GPU Current Temp : 33 C
추가 정보
CPU 정보 (Intel(R) Xeon(R) Gold 6238R 프로세서 *2)
Memory 정보 (32G * 12)
GPU 정보 (RTX 3080 * 8)
> nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:1B:00.0 Off | N/A |
| 50% 29C P0 83W / 320W | 0MiB / 10240MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:1C:00.0 Off | N/A |
| 38% 29C P0 81W / 320W | 0MiB / 10240MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce ... Off | 00000000:1D:00.0 Off | N/A |
| 50% 29C P0 78W / 320W | 0MiB / 10240MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce ... Off | 00000000:1E:00.0 Off | N/A |
| 52% 30C P0 85W / 320W | 0MiB / 10240MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA GeForce ... Off | 00000000:DB:00.0 Off | N/A |
| 50% 31C P0 82W / 320W | 0MiB / 10240MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA GeForce ... Off | 00000000:DC:00.0 Off | N/A |
| 60% 30C P0 77W / 320W | 0MiB / 10240MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA GeForce ... Off | 00000000:DD:00.0 Off | N/A |
| 50% 29C P0 85W / 320W | 0MiB / 10240MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA GeForce ... Off | 00000000:DE:00.0 Off | N/A |
| 30% 29C P0 85W / 320W | 0MiB / 10240MiB | 3% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+