Training crashes with RTX 2080 Ti

716 views
Skip to first unread message

Cemil Demir

unread,
Jul 11, 2019, 5:30:06 PM7/11/19
to kaldi...@googlegroups.com
Hi,

I have 4 RTX 2080 Ti on the same workstation. I run many GPU burn experiments, there is no problem in gpu burn experiments. Moreover, i trained some models using Wav2Letter without any problem.

However, when i try to train Kaldi chain models, the traning crashes after some iterations. I used Ubuntu 18.04, 19.04 with Cuda 9 and 10. But i could not solve this problem.

My question is: Are there anybody use RTX 2080 Ti in Kaldi Training?

If yes, which OS, Nvidia driver version and Cuda version are being used?

Thank you.

--

Daniel Povey

unread,
Jul 11, 2019, 5:31:21 PM7/11/19
to kaldi-help
Your description of the crash is too non-specific.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAPgd1btfAZOiiy%3D8_3e%2BbxVCPfyCGMU8h2mAP-3ZBUB%3DPVStWg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Cemil Demir

unread,
Jul 12, 2019, 2:56:04 AM7/12/19
to kaldi...@googlegroups.com
Hi Daniel, after some iterations, one of the gpus is getting non-usable (when i run nvidia-smi, it gives error for the crashed gpu). Therefore, the training stops. 

There is nothing about the crash in the Kaldi logs. There a re 4 training logs. I attached log files for the last iteration. Since 3rd Gpu is crashed, 3rd log is not updated.

When i run nvidia-smi, it gives !ERR for 3rd GPU.

My question is: are there anybody who uses RTX 2080 Ti in Kaldi Training and getting some crashes? If not, which configuration (OS, Driver and CUDA version) is used.

Thank you for your interest and help.  


For more options, visit https://groups.google.com/d/optout.


--
train.16.2.log
train.16.4.log
train.16.3.log
train.16.1.log

Roshan S Sharma

unread,
Jul 12, 2019, 3:06:27 AM7/12/19
to kaldi...@googlegroups.com
I think you need compute exclusive mode to fix this- I used to have similar issues.

Set nvidia-smi -c 3

Cemil Demir

unread,
Jul 12, 2019, 9:37:18 AM7/12/19
to kaldi...@googlegroups.com
I always use gpus in compute exclusive mode. It is not solving my problem.

Daniel Povey

unread,
Jul 12, 2019, 10:15:09 AM7/12/19
to kaldi-help
if it's always the same GPU that crashes, it may be a hardware issue on that specific GPU; in that case you could disable it by setting e.g.
export CUDA_VISIBLE_DEVICES=0,1,3
in your path.sh


Charl van Heerden

unread,
Aug 4, 2019, 11:44:42 AM8/4/19
to kaldi-help
Hi Cemil,

Have you resolved your issue with the 2080Ti? I recently bought two new 2080Ti's, and I'm experiencing the same problem as you describe, ie:
- training chain models results in the entire system completely freezing after 10+ iterations of training, which requires a hard reset
- I have tried ubuntu 16.04 as well as 18.04, Nvidia drivers from 418 to the most recent release (430.40), as well as using various different versions of CUDA (10.0 & 10.1). The only observation was that the crashes happened more frequently using the latest 430.40 NVidea driver
- It's also important to note that I've used exactly the same setup (Cuda 10.1 & Nvidea 418) successfully with a Tesla K80, so I'm confident that this is something that is either 2080Ti specific, or exposed by this card
- The observed symptoms are identical to what is described in this thread: https://github.com/davisking/dlib/issues/1513 (I have no idea if this problem is related to cudaStreamSynchronize though)
- I've run two hours of gpu-burn without any issues: are there any other GPU stress tests I can run to exclude a faulty card?
- One difference to your description is that this is happening using both, or either one of my cards: I have also used CUDA_VISIBLE_DEVICES to only use one card at a time, and both cards result in the system freezing
- I do not see any errors in the logs
- The system freezes after ~10 and prior to reaching 50 consecutive iterations of Cuda training

Any advice on how to debug this further would be much appreciated. I would also appreciate any feedback from the community on successful training with 2080Ti's, and if so which Cuda + NVidea driver combination was used?

Charl

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.


--

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

Daniel Povey

unread,
Aug 4, 2019, 2:57:28 PM8/4/19
to kaldi-help
In my experience with system administration at Hopkins, the system completely freezing up like that tends to result from hardware issues: either power supply deficiencies or motherboard problems.  It's just that the problems won't show up when the system is idle.
Whoever built your machine may deny it, though.

Dan


To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/0930b963-0435-4671-8875-14405ae88e54%40googlegroups.com.

Daniel Povey

unread,
Aug 4, 2019, 2:57:28 PM8/4/19
to kaldi-help
.. you could try unplugging and reseating the GPU cards, though.  That sometimes helps.  But I doubt it will help here.

Charl van Heerden

unread,
Aug 5, 2019, 10:54:36 AM8/5/19
to kaldi-help
Thank you, Dan. I'm in the process of organizing the desktop to be sent back to the supplier, who will test the power supply and motherboard for me. (As you also experienced, they assured me that the power supply is sufficient, and all hardware checks passed upon installation)

I ran the process today with cuda-memcheck, and got this output (I'm at iteration 273 because every time the system freezes, I continue from where it left off). Also using just 1 GPU at the moment.
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2019-08-05 14:35:34,480 [steps/nnet3/chain/train.py:521 - train - INFO ] Iter: 273/389   Jobs: 1   Epoch: 7.00/10.0 (70.0% complete)   lr: 0.000100   
========= Program hit cudaErrorCudartUnloading (error 4) due to "driver shutting down" on CUDA API call to cudaFree. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x390513]
=========     Host Frame:/usr/local/cuda/lib64/libcudart.so.10.1 (cudaFree + 0x186) [0x471e6]
=========     Host Frame:/home/cvheerden/local/src/kaldi/src/lib/libkaldi-cudamatrix.so (_ZN5kaldi17CuMemoryAllocatorD1Ev + 0x3c) [0xa2a5c]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__cxa_finalize + 0x9a) [0x3a36a]
=========     Host Frame:/home/cvheerden/local/src/kaldi/src/lib/libkaldi-cudamatrix.so [0x340c3]
=========
2019-08-05 16:32:02,879 [steps/nnet3/chain/train.py:521 - train - INFO ] Iter: 274/389   Jobs: 1   Epoch: 7.03/10.0 (70.3% complete)   lr: 0.000099   
2019-08-05 16:32:56,264 [steps/nnet3/chain/train.py:521 - train - INFO ] Iter: 275/389   Jobs: 1   Epoch: 7.05/10.0 (70.5% complete)   lr: 0.000099   
Timeout, server 192.168.115.41 not responding.
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

This is the first time the process broke after only 2 iterations. I'm also not sure if the cuda error was printed at the correct point (ie, from the logs, iteration 273 completed successfully, whereas iteration 274 broke, but the cuda error is printed inbetween the two; this may just be stderr flushing I assume?).

Also, the last log that contains anything (274.1.log), ends with this:
LOG (nnet3-chain-train[5.5.437~1-5b26]:PrintStatsForThisPhase():nnet-training.cc:278) Average objective function for 'output-xent' for minibatches 140-149 is -1.39849 over 58624 frames.
LOG (nnet3-chain-train[5.5.437~1-5b26]:PrintStatsForThisPhase():nnet-training.cc:278) Average objective function for 'output' for minibatches 140-149 is -0.0941723 over 58624 frames.
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^ (many more @'s; I stopped here)
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/0930b963-0435-4671-8875-14405ae88e54%40googlegroups.com.

orum farhang

unread,
Aug 5, 2019, 11:19:00 AM8/5/19
to kaldi-help
Did you check the GPU's temperature during training? 

Daniel Povey

unread,
Aug 5, 2019, 12:42:30 PM8/5/19
to kaldi-help
You might want to make sure you are running the nvidia persistence daemon (e.g. systemctl nvidia-persistenced start, or just sudo nvidia-persistenced).  I very much doubt that's the issue though, I suspectc power supply insufficiency.

The binary zeros in the log is also normal if the machine crashed, it means it didn't have time to flush that data to disk.

We once had a machine with GPUs here that was randomly crashing every few weeks.  We had bought it together with another identical machine that was crashing every day or two.  The supplier replaced the day-or-two one but refused to replace the "every few weeks" one until I happened to mention it to NVidia and they put pressure on them.

Dan


To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/db692a3b-c5a2-4cb8-a4f0-32bacdfab2a0%40googlegroups.com.

Charl van Heerden

unread,
Aug 5, 2019, 1:09:29 PM8/5/19
to kaldi-help
Thank you Dan. I just double-checked: did sudo nvidia-persistenced, restarted the job and again crashed within a few iterations. I will be sure to use your advice when negotiating with the supplier and make sure a power-supply is the first component they test/replace.


@Orum: than you for the suggestion. I checked the temperature every ~100ms, and it never went above 84C (last couple of entries below).
18:42:40.327959195 84C 53C 
18:42:40.465165645 83C 53C 
18:42:40.601967147 81C 53C 
18:42:40.739607601 80C 53C 
18:42:40.886587581 79C 53C 
18:42:41.024732406 79C 53C 
18:42:41.162284848 79C 53C 
18:42:41.336102444 79C 53C 
18:42:41.476210222 79C 53C 
18:42:41.617648479 79C 53C 
18:42:41.757778014 78C 53C 
18:42:41.897454406 78C 52C 
18:42:42.035441552 78C 53C 
18:42:42.174293148 78C 53C 
Timeout, server 192.168.115.41 not responding.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/db692a3b-c5a2-4cb8-a4f0-32bacdfab2a0%40googlegroups.com.

Truong Do

unread,
Aug 6, 2019, 12:21:35 AM8/6/19
to kaldi-help
You might want to use CUDA 10.

Jonathan K

unread,
Aug 6, 2019, 3:00:43 AM8/6/19
to kaldi-help
I have Ubuntu 18.04 with RTX 2080 Ti (x2) and working. I used this guide to install the nvidia driver and the nvidia toolkit (+cudnn!) and everything is working for me: https://medium.com/@avinchintha/how-to-install-nvidia-drivers-and-cuda-10-0-for-rtx-2080-ti-gpu-on-ubuntu-16-04-18-04-ce32e4edf1c0

Charl van Heerden

unread,
Aug 6, 2019, 10:18:12 AM8/6/19
to kaldi-help
Thank you for the suggestions.

@Truong: I've tried Cuda 9.2, 10.0, 10.1 and 10.1 update (which displays as 10.2 in nvidea-smi).

@Jonathan, thank you for the link. I just followed those steps to the letter, and unfortunately still the same result. I think Dan's hypothesis is most likely correct, ie, there is a power supply and or hardware error.

Daniel Povey

unread,
Aug 6, 2019, 1:11:59 PM8/6/19
to kaldi-help
FYI- I remember the guys who developed CNTK saying that at one point they had to modify their code to make it a little bit
less efficient, to avoid crashing the GPUs they had their server-farm had at one point.  



--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/daf138f4-e845-485c-aacd-b0d6bd346f64%40googlegroups.com.

joseph.an...@gmail.com

unread,
Aug 9, 2019, 7:27:18 AM8/9/19
to kaldi-help
The only reason I had NVIDIA GPUs crash while running Kaldi code was when the power supply wasn't sufficient. In one case this was with an improperly plugged in GPU power connector. I believe these are dual connector ones. Check if both power connectors are plugged in and ensure sufficient power is delivered to them. What's the power rating of the power supply and what's rest of the components power usage totalling up to?

Anand
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

Charl van Heerden

unread,
Aug 21, 2019, 11:01:46 AM8/21/19
to kaldi-help
Thank you Anand, Dan, and everyone else for the kind and helpful suggestions. Herewith an update on the current status:
* I spent today at the suppliers. They tested every component, including the power supplies and GPUs in Windows, and could not find any errors even when stress testing.
* I then demonstrated the system freezing I'm encountering when training a chain model. When I mentioned that the speech community suspects this is related to the power supply, they then updated the BIOS as they've had previous practical experience of power issues being resolved by BIOS updates. The chain model training ran smoothly for 6 hours, and just completed!

Thanks once again for all your advice. I'll post further updates if I encounter this issue again, or if I've been able to train some of our larger models successfully using the 2080Ti's.

Cemil Demir

unread,
Aug 21, 2019, 12:46:02 PM8/21/19
to kaldi...@googlegroups.com
Hi, 
my training was crashed after 18 hours. I also did bios update but it did not solve my problem.
i am trying to train models using 2000 hours of data with 4 RTX 2080 Ti. I also tried to use limit power usage using "nvidia-smi -pl" command. However, i did not succeed to train model. Original power limit was 250 Watt. I decreased it to 220 Watt. 
Now, i am expecting motherboard from my workstation vendor. If i can solve this problem with motherboard change, i will update you.

Any help will be useful. Thank you. Regards.


To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/cc449ebe-d41a-4c81-bd04-940d5397672e%40googlegroups.com.

Daniel Povey

unread,
Aug 21, 2019, 2:41:46 PM8/21/19
to kaldi-help
From "The Tao of Programming":

  Hardware met Software on the road to Changtse. Software said:  "You are Yin and I am Yang. If we travel together, we will become famous and earn vast sums of money." And so they set forth together, thinking to conquer the world.

  Presently, they met Firmware, who was dressed in tattered rags and hobbled along propped on a thorny stick. Firmware said to  them: "The Tao lies beyond Yin and Yang. It is silent and still as a pool of water. It does not seek fame; therefore, nobody knows its presence. It does not seek fortune, for it is complete within itself. It exists beyond space and time."

  Software and Hardware, ashamed, returned to their homes.

Cemil Demir

unread,
Sep 6, 2019, 3:23:15 AM9/6/19
to kaldi...@googlegroups.com
Hi,
I have good news about this issue. 

We could finish a training with 4 RTX 2080 Ti GPU cards. it tooks about 89 hours with no crash. 

What we did to solve this problem? We changed the mainboard in our workstation. Previously, we were using a Gigabyte mainboard, the training was crashing.
Now, we changed the mainboard, and started using an ASUS mainboard. The current mainboard info is shown below.
Base Board Information
    Manufacturer: ASUSTeK COMPUTER INC.
    Product Name: WS C422 SAGE/10G
    Version: Rev 1.xx

Thank you Daniel for your suggestions.
Regards.


Reply all
Reply to author
Forward
0 new messages