zeroth 구동 중 GPU 인식 실패 현상

552 views
Skip to first unread message

goorme

unread,
Mar 7, 2022, 2:17:47 AM3/7/22
to zeroth-help
zeroth 구동 시 GPU 인식 실패 현상이 있어서 stage 12에서 중단 됩니다.

GPU / CPU stress 테스트 시에는 해당 현상이 나지 않는 것 같아서 혹시 zeroth 구동 시 확인해야할 사항이 있을 까 해서 공유합니다.

8개 GPU 사용중에서 최초 GPU 인식 실패 후 한개씩 인식 실패하여 모두 인식 실패합니다.

1. zeroth 중지 로그

> vi nohup.out
steps/nnet3/chain/get_egs.sh: Finished preparing training examples
2022-03-02 14:04:00,697 [steps/nnet3/chain/train.py:431 - train - INFO ] Copying the properties from exp/chain_rvb/tdnn1n_rvb/egs to exp/chain_rvb/tdnn1n_rvb
2022-03-02 14:04:00,753 [steps/nnet3/chain/train.py:445 - train - INFO ] Computing the preconditioningmatrix for input features
2022-03-02 14:04:07,877 [steps/nnet3/chain/train.py:454 - train - INFO ] Preparing the initial acoustic model.
2022-03-02 14:04:08,832 [steps/nnet3/chain/train.py:488 - train - INFO ] Training will run for 4.0 epochs = 14169 iterations
2022-03-02 14:04:08,835 [steps/nnet3/chain/train.py:535 - train - INFO ] Iter: 0/14168   Jobs: 2   Epoch: 0.00/4.0 (0.0% complete)   lr: 0.003000
run.pl: job failed, log is in exp/chain_rvb/tdnn1n_rvb/log/train.0.1.log

> vi exp/chain_rvb/tdnn1n_rvb/log/train.0.2.log

 ERROR (nnet3-chain-train[5.5.1009~2-e4940]:SelectGpuId():cu-device.cc:181) No CUDA GPU detected!, diagnostics: cudaError_t 100 : "no CUDA-capable device is detected", in cu-device.cc:181

[ Stack-Trace: ]
/home/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0xb0b) [0x7f56d4a19823]
nnet3-chain-train(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x11) [0x4108e7]
/home/kaldi/src/lib/libkaldi-cudamatrix.so(kaldi::CuDevice::SelectGpuId(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x402) [0x7f56d6545e58]
nnet3-chain-train(main+0x482) [0x40f6c9]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f5695008555]
nnet3-chain-train() [0x40f199]
kaldi::KaldiFatalError
# Accounting: time=8 threads=1
# Ended (code 255) at 2022. 03. 02. (수) 14:04:23 KST, elapsed time 8 seconds

2. /var/log/messages
Feb 28 10:43:23 localhost kernel: NVRM: GPU at PCI:0000:de:00: GPU-2ee13725-4b47-ba34-fcb0-6204c08d5be3
Feb 28 10:43:23 localhost kernel: NVRM: Xid (PCI:0000:de:00): 62, pid=20703, 0000(0000) 00000000 00000000
Feb 28 10:43:23 localhost kernel: sched: RT throttling activated
Feb 28 10:44:31 localhost kernel: NVRM: GPU 0000:de:00.0: RmInitAdapter failed! (0x23:0xffff:1401)
Feb 28 10:44:31 localhost kernel: NVRM: GPU 0000:de:00.0: rm_init_adapter failed, device minor number 7
...
Mar  2 14:22:18 localhost kernel: NVRM: GPU 0000:1b:00.0: rm_init_adapter failed, device minor number 0
Mar  2 14:22:18 localhost kernel: NVRM: GPU 0000:1c:00.0: RmInitAdapter failed! (0x23:0xffff:1401)
Mar  2 14:22:18 localhost kernel: NVRM: GPU 0000:1c:00.0: rm_init_adapter failed, device minor number 1
Mar  2 14:22:18 localhost kernel: NVRM: GPU 0000:1c:00.0: RmInitAdapter failed! (0x23:0xffff:1401)
Mar  2 14:22:18 localhost kernel: NVRM: GPU 0000:1c:00.0: rm_init_adapter failed, device minor number 1
Mar  2 14:22:18 localhost kernel: NVRM: GPU 0000:1d:00.0: RmInitAdapter failed! (0x23:0xffff:1401)
Mar  2 14:22:18 localhost kernel: NVRM: GPU 0000:1d:00.0: rm_init_adapter failed, device minor number 2

3. GPU 부하 테스트 => GPU 인식 실패 하지않음
gpu_burn 설치하여 gpu 부하 테스트 진행 => https://eungbean.github.io/2018/08/25/Gpu-stress-test/

2022. 03. 04. (금) 14:31:30 KST

utilization.gpu [%]
100 %
100 %
100 %
100 %
100 %
100 %
100 %
100 %

CPU / GPU 온도 체크

2022. 03. 04. (금) 15:26:07 KST
Package id 0:  +39.0°C  (high = +92.0°C, crit = +102.0°C)
Package id 1:  +38.0°C  (high = +92.0°C, crit = +102.0°C)
CPU Max Temp : 39°C
Check GPU Temp...
        GPU Current Temp                  : 70 C
        GPU Current Temp                  : 62 C
        GPU Current Temp                  : 66 C
        GPU Current Temp                  : 67 C
        GPU Current Temp                  : 66 C
        GPU Current Temp                  : 69 C
        GPU Current Temp                  : 66 C
        GPU Current Temp                  : 67 C

4. CPU 부하 테스트 => GPU 인식 실패 하지않음

s-tui 사용

yum install stress

CPU 전체코어 100% 25시간 동안 실행

2022. 03. 04. (금) 16:45:26 KST
Package id 0:  +70.0°C  (high = +92.0°C, crit = +102.0°C)
Package id 1:  +67.0°C  (high = +92.0°C, crit = +102.0°C)
CPU Max Temp : 70°C
Check GPU Temp...
        GPU Current Temp                  : 36 C
        GPU Current Temp                  : 33 C
        GPU Current Temp                  : 33 C
        GPU Current Temp                  : 34 C
        GPU Current Temp                  : 33 C
        GPU Current Temp                  : 34 C
        GPU Current Temp                  : 33 C
        GPU Current Temp                  : 33 C


추가 정보


CPU 정보 (Intel(R) Xeon(R) Gold 6238R 프로세서 *2)

Memory 정보 (32G * 12)

GPU 정보 (RTX 3080 * 8)

> nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:1B:00.0 Off |                  N/A |
| 50%   29C    P0    83W / 320W |      0MiB / 10240MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:1C:00.0 Off |                  N/A |
| 38%   29C    P0    81W / 320W |      0MiB / 10240MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  Off  | 00000000:1D:00.0 Off |                  N/A |
| 50%   29C    P0    78W / 320W |      0MiB / 10240MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  Off  | 00000000:1E:00.0 Off |                  N/A |
| 52%   30C    P0    85W / 320W |      0MiB / 10240MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce ...  Off  | 00000000:DB:00.0 Off |                  N/A |
| 50%   31C    P0    82W / 320W |      0MiB / 10240MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce ...  Off  | 00000000:DC:00.0 Off |                  N/A |
| 60%   30C    P0    77W / 320W |      0MiB / 10240MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce ...  Off  | 00000000:DD:00.0 Off |                  N/A |
| 50%   29C    P0    85W / 320W |      0MiB / 10240MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce ...  Off  | 00000000:DE:00.0 Off |                  N/A |
| 30%   29C    P0    85W / 320W |      0MiB / 10240MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
Reply all
Reply to author
Forward
0 new messages