nnet2 training with GPU leak memory (main memory).

311 views
Skip to first unread message

Luan Pham

unread,
Mar 16, 2016, 10:27:08 AM3/16/16
to kaldi-help
Dear all,

I have this problem where our server would run out of memory and crash. A couple of research and this seems likely the problem: https://github.com/matth/kaldi-cuda-leak-example 

To summary, I have many iterations in my training of nnet2. And I observed that after around 100 or so interactions, the main memory of the server increases, Our server has 132GB of memory but it still crashes after around 2 days of training. The big problem is that even if I stop all the jobs half way, all the memory was not reclaimed by the system (only main memory, the GPU memory is cleared). I think the problem is when using GPU, each iteration allocates a small memory but didn't free it after finishing. According to the documentation :

Rather than calling the malloc and free functions that NVidia provides, Kaldi does caching of previously released memory so that we don't have to incur the overhead of NVidia's malloc. ....  Anyway, the memory caching can cause a problem if for some reason you run using the default (non-exclusive) compute mode, because it can cause allocation failures. You can disable it at the code level by callingCuDevice::Instantiate().DisableCaching(), if needed.

I don't have exclusive mode because the server is shared. But when I look into the code to add DisableCaching() function, the function is gone (according to this github commit)

So I just want a quick fix for this problem. Maybe after some trainings, I want to free the memory that was hold by the GPU and continue. Is it a good way to fix this problem? 

Thank you,

Ruoho Ruotsi

unread,
Mar 16, 2016, 12:50:08 PM3/16/16
to kaldi-help
What are your GPU specifications? Which corpus are you using and are you using Kaldi HEAD? 
I'm actively training with nnet2, using librispeech++.  My GPU only has 6GB and I haven't run into this issue, but now I'm curious.

Luan Pham

unread,
Mar 16, 2016, 12:52:22 PM3/16/16
to kaldi-help
Here is the output from my nvidia-smi. I'm training TEDLIUM corpus 

+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980     Off  | 0000:03:00.0     Off |                  N/A |
| 16%   35C    P0    52W / 270W |     14MiB /  4095MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 980     Off  | 0000:83:00.0     Off |                  N/A |
|  0%   34C    P0    42W / 270W |     14MiB /  4095MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Daniel Povey

unread,
Mar 16, 2016, 2:09:39 PM3/16/16
to kaldi-help
Those warnings about allocation failures are just warnings, they are harmless.
It is impossible for a bug *any* userspace code, Kaldi or otherwise, to be the cause the problem you describe, because all memory is freed when a process exits, and Kaldi doesn't use long-running proceses to train.
I suspect your machine is failing for reasons unrelated to exhaustion of memory.
You cannot trust the 'free' memory reported by 'top' because of caching of files on disk.  You should add up the free + buffers + cached to find out the truly free memory.
It's normal for the 'cached' memory to gradually increase, and it's harmless.  If your machine is crashing and there is nothing in the log, it's likely due to hardware problems with your CUDA cards, or an insufficient power supply.
Dan


--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ruoho Ruotsi

unread,
Mar 16, 2016, 2:10:47 PM3/16/16
to kaldi-help
I think Kaldi expects exclusive mode, so overloading/sharing the GPU with too many jobs can cause some jobs to crash. See Dan's comments here: https://sourceforge.net/p/kaldi/discussion/1355348/thread/13866b7f/
I'm running in exclusive mode with a similar card to you:

+------------------------------------------------------+                       
| NVIDIA-SMI 352.79     Driver Version: 352.79         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980 Ti  Off  | 0000:03:00.0      On |                  N/A |
| 60%   83C    P2   148W / 250W |   3369MiB /  6143MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1764    G   /usr/bin/X                                     772MiB |
|    0      3048    G   compiz                                         351MiB |
|    0      3799    G   ...s-passed-by-fd --v8-snapshot-passed-by-fd   187MiB |
|    0      6819    G   ...vendor=NVIDIA --gpu-driver-version=352.79   239MiB |
|    0     13143    G   /usr/lib/firefox/firefox                         2MiB |
|    0     21862    C   nnet-train-simple                              295MiB |
|    0     21863    C   nnet-train-simple                              295MiB |
|    0     21865    C   nnet-train-simple                              295MiB |
|    0     21872    C   nnet-train-simple                              295MiB |
|    0     21881    C   nnet-train-simple                              295MiB |
|    0     21883    C   nnet-train-simple                              295MiB |
|    0     28697    G   /usr/bin/nvidia-settings                         6MiB |
+-----------------------------------------------------------------------------+

Daniel Povey

unread,
Mar 16, 2016, 2:15:23 PM3/16/16
to kaldi-help
Just noticed that part- there is no need to disable the caching any more as it never caches very much memory any more.  Anyway it's not the cause of your problems.
Dan


--

Luan Pham

unread,
Mar 17, 2016, 6:19:03 AM3/17/16
to kaldi-help, dpo...@gmail.com
@Runoho: I changed the variable --num-jobs-nnet to 4. When this variable lower, the number of iterations increase. In my case, when I change num-jobs-nnet from 16 to 4, the number of iterations increase from 200 to around 1900. The problem only happened when you have a lot of iterations. Most of the time, the training will finish before the server runs out of memory. But even in a small number of iteration like 150 - 300 iterations, you can still see the leak. I think you can compare your memory usage before and after your training. I trained my model and stop when it hit around 140 iterations and I noticed and increased of 25GB of RAM uage. And after I stop the processes, the high RAM usage remains the same. 

@Dan: Thanks for your response. I will try to train in a different server and attach some screenshot next week cause the server is currently being used for some different project. I still think the problem is caused by leaking memory. I used htop to monitor the memory usage and it can distinct 3 types of memory: used memory, cached and buffer. And I can see the used memory behave normally with other processes: increase when launched and decreased when exit. But the memory increased by the training process is not reclaimed even if the training process is killed or terminated. So when I train a new model or continue my training, the leak memory keeps stacking up. As mentioned in previous posts, this happened when I have a lot of iterations (I have around 1900 iterations in 1 training because I decrease the --num-jobs-nnet). Most of the time, when I used the defaulted script, the number of iterations is only around 200 so the process finished before out of memory. But even in that case, I can still see the used memory increase in the server. 

Daniel Povey

unread,
Mar 17, 2016, 7:42:34 PM3/17/16
to Luan Pham, kaldi-help
The output of the two commands may help track down where the memory is going.

cat /proc/slabinfo | awk '{x=$4*$14; print x, $0; }' | sort -nr 

ps -Al | awk '$10 > 1000000' 

Luan Pham

unread,
Mar 18, 2016, 1:36:10 PM3/18/16
to dpo...@gmail.com, kaldi-help
Dear Daniel, 

Sorry for the late reply. I don't have root so that I need to wait for the person with root access.

The output of slabinfo: 

380955576 nfs_inode_cache   1107408 1107429   1032    3    1 : tunables   24   12    8 : slabdata 369143 369143      0
317128704 kmalloc-16384      19356  19356  16384    1    4 : tunables    8    4    0 : slabdata  19356  19356      0
85393408 kmalloc-65536       1303   1303  65536    1   16 : tunables    8    4    0 : slabdata   1303   1303      0
35709696 radix_tree_node   433828 433972    576    7    1 : tunables   54   27    8 : slabdata  61996  61996      0
35504128 kmalloc-8192        4334   4334   8192    1    2 : tunables    8    4    0 : slabdata   4334   4334      0
32837632 kmalloc-4096        8017   8017   4096    1    1 : tunables   24   12    8 : slabdata   8017   8017      0
15500000 ext4_inode_cache   61906  62000   1000    4    1 : tunables   54   27    8 : slabdata  15500  15500      0
11418240 dentry            1246692 1248870    192   21    1 : tunables  120   60    8 : slabdata  59470  59470      0
5943296 xfs_inode          23092  23216   1024    4    1 : tunables   54   27    8 : slabdata   5804   5804      0
4314624 kmalloc-512        67251  67416    512    8    1 : tunables   54   27    8 : slabdata   8427   8427      0
4128768 kmalloc-2048        3909   4032   2048    2    1 : tunables   24   12    8 : slabdata   2016   2016      0
3485352 buffer_head       1301018 1307007    104   39    1 : tunables  120   60    8 : slabdata  33513  33513      0
1992704 kmalloc-1024        7381   7784   1024    4    1 : tunables   54   27    8 : slabdata   1946   1946      0
1809088 nvidia_pte_cache  7009760 7010216     32  124    1 : tunables  120   60    8 : slabdata  56534  56534      0
1706752 kmalloc-256       105614 106672    256   16    1 : tunables  120   60    8 : slabdata   6667   6667      0
1433088 inode_cache        17359  17416    576    7    1 : tunables   54   27    8 : slabdata   2488   2488      0
685392 idr_layer_cache      939    981   2096    3    2 : tunables   24   12    8 : slabdata    327    327      0
663488 kmalloc-64        652311 653121     64   63    1 : tunables  120   60    8 : slabdata  10367  10367      0
536320 proc_inode_cache    4121   5028    640    6    1 : tunables   54   27    8 : slabdata    838    838      0
505920 kmalloc-192        54175  55335    192   21    1 : tunables  120   60    8 : slabdata   2635   2635      0
498256 task_struct          507    627   2384    3    2 : tunables   24   12    8 : slabdata    209    209      0
416064 sighand_cache        497    591   2112    3    2 : tunables   24   12    8 : slabdata    197    197      0
403712 kmalloc-96         96723  97774    128   31    1 : tunables  120   60    8 : slabdata   3154   3154      0
262144 kmalloc-262144         1      1 262144    1   64 : tunables    1    1    0 : slabdata      1      1      0
262144 kmalloc-131072         2      2 131072    1   32 : tunables    8    4    0 : slabdata      2      2      0
260672 kmalloc-32        1009578 1010104     32  124    1 : tunables  120   60    8 : slabdata   8146   8146      0
221184 biovec-256            54     54   4096    1    1 : tunables   24   12    8 : slabdata     54     54      0
163840 kmalloc-32768          5      5  32768    1    8 : tunables    8    4    0 : slabdata      5      5      0
160688 shmem_inode_cache   1248   1452    664    6    1 : tunables   54   27    8 : slabdata    242    242      0
144720 Acpi-Operand      111808 112560     72   56    1 : tunables  120   60    8 : slabdata   2010   2010      0
139800 kernfs_node_cache  38390  38445    120   33    1 : tunables  120   60    8 : slabdata   1165   1165      0
136952 xfs_ili            23090  23426    152   26    1 : tunables  120   60    8 : slabdata    901    901      0
112896 signal_cache         497    686   1152    7    2 : tunables   24   12    8 : slabdata     98     98      0
106176 kmem_cache           233    237   1344    3    1 : tunables   24   12    8 : slabdata     79     79      0
83200 task_xstate          511    900    832    9    2 : tunables   54   27    8 : slabdata    100    100      0
77824 names_cache           19     19   4096    1    1 : tunables   24   12    8 : slabdata     19     19      0
74240 sock_inode_cache     581    696    640    6    1 : tunables   54   27    8 : slabdata    116    116      0
64512 RAW                  288    288    896    4    1 : tunables   54   27    8 : slabdata     72     72      0
62976 vm_area_struct      4599   6888    192   21    1 : tunables  120   60    8 : slabdata    328    328     80
62208 filp                1344   3888    256   16    1 : tunables  120   60    8 : slabdata    243    243      0
44544 xfs_buf             1122   1160    384   10    1 : tunables   54   27    8 : slabdata    116    116      0
40576 kmalloc-128         9165   9827    128   31    1 : tunables  120   60    8 : slabdata    317    317      0
30720 files_cache          199    288    640    6    1 : tunables   54   27    8 : slabdata     48     48      0
28800 UNIX                  86    120    960    4    1 : tunables   54   27    8 : slabdata     30     30      0
26880 mm_struct            110    112    960    4    1 : tunables   54   27    8 : slabdata     28     28      0
23936 RAWv6                147    154   1088    7    2 : tunables   24   12    8 : slabdata     22     22      0
16704 cred_jar             691   1827    192   21    1 : tunables  120   60    8 : slabdata     87     87      0
13888 blkdev_queue           8     14   1984    2    1 : tunables   24   12    8 : slabdata      7      7      0
12544 TCP                   13     14   1792    2    1 : tunables   24   12    8 : slabdata      7      7      0
12288 nvidia_stack_cache      1      1  12288    1    4 : tunables    8    4    0 : slabdata      1      1      0
11904 TCPv6                  8     12   1984    2    1 : tunables   24   12    8 : slabdata      6      6      0
10240 rpc_buffers           10     10   2048    2    1 : tunables   24   12    8 : slabdata      5      5      0
8512 anon_vma_chain      3505   8379     64   63    1 : tunables  120   60    8 : slabdata    133    133     80
8192 sgpool-128             2      2   4096    1    1 : tunables   24   12    8 : slabdata      2      2      0
8080 anon_vma            2282   5050     80   50    1 : tunables  120   60    8 : slabdata    101    101     32
7488 bdev_cache            13     36    832    4    1 : tunables   54   27    8 : slabdata      9      9      0
7168 nfs_write_data        32     32    896    4    1 : tunables   54   27    8 : slabdata      8      8      0
7080 ext4_extent_status  17065  17523     40   99    1 : tunables  120   60    8 : slabdata    177    177      0
7056 ext4_groupinfo_4k   1330   1372    144   28    1 : tunables  120   60    8 : slabdata     49     49      0
6272 UDP                   20     28    896    4    1 : tunables   54   27    8 : slabdata      7      7      0
5440 UDPv6                  9     35   1088    7    2 : tunables   24   12    8 : slabdata      5      5      0
5376 task_delay_info      512   1728    112   36    1 : tunables  120   60    8 : slabdata     48     48      0
5376 pid                  375   1302    128   31    1 : tunables  120   60    8 : slabdata     42     42      0
4640 cfq_queue            126    340    232   17    1 : tunables  120   60    8 : slabdata     20     20      0
4480 rpc_inode_cache       25     42    640    6    1 : tunables   54   27    8 : slabdata      7      7      0
4000 Acpi-Namespace      9833   9900     40   99    1 : tunables  120   60    8 : slabdata    100    100      0


I cut short the output with small memory. The other command yields no output cause the server does not run jobs. Yet, the server still uses 30GB memory. Looks like the kmalloc is the culprit? 
 

PS: I attach the full output. 

--
Luan Pham

slabinfo

Daniel Povey

unread,
Mar 18, 2016, 2:22:30 PM3/18/16
to Luan Pham, kaldi-help
kmalloc-* are likely just pools of memory that the kernel allocates from.  Everything looks to me normal there; I don't see a problem. 
Dan

Daniel Povey

unread,
Mar 18, 2016, 3:05:23 PM3/18/16
to Luan Pham, kaldi-help
Actually, I think I was wrong.
kmalloc can only be called by kernel or driver code.  That memory is actually being used by the kernel, or drivers (active-slabs == num-slabs).  And it's a huge amount of memory.
That points to a kernel or driver bug, which is what I originally thought.
There's not much you can do apart from reinstalling the NVidia drivers with a newer (or at least different) version- that's the most likely cause.  
Dan

Ruoho Ruotsi

unread,
Mar 18, 2016, 3:06:45 PM3/18/16
to kaldi-help, thanhl...@gmail.com, dpo...@gmail.com
Hi Luan,
here are the debugging ideas that come to mind: 

  • Rule out your NVIDIA driver being a factor, nvidia-smi reports your driver is: Driver Version: 352.39  (i.e from last summer, so pretty old)  You can look at newer drivers or the latest (http://www.nvidia.com/Download/index.aspx?lang=en-us) I've found my version 352.79, to be good enough with CUDA 7.5
  • For memory usage profiling, that one can understand, even for a complex project like Kaldi, I'd recommend a tool like: https://developer.nvidia.com/nvidia-visual-profiler
  • Rule out Kaldi. This means try to reproduce the issues (via large memory reads/writes, etc) from a simple CUDA program.  For example, modifying one of the NVIDIA_CUDA-7.5_Samples will be easier to understand/debug and you will quickly understand your card's performance limits. If you can repro the issue in a small stand-alone program, you can also more quickly test out any resolutions 
  • Rule out exclusive vs shared mode. Do you know who else is submitting jobs? Is the server accepting asynchronous requests? Can you guarantee you're the only one working on it?

Daniel Povey

unread,
Mar 18, 2016, 3:07:58 PM3/18/16
to Ruoho Ruotsi, kaldi-help, Luan Pham
All userspace programs (which includes Kaldi) are already ruled out by the diagnostics he ran.  That memory is owned in kernel space -> must be driver (likely) or kernel (unlikely).
Dan

Luan Pham

unread,
Mar 18, 2016, 3:11:13 PM3/18/16
to Daniel Povey, Ruoho Ruotsi, kaldi-help
I agree with both of you that it's likely the driver is the problem. I will install a new driver next Monday and let you know. 

Thank you for your help,
--
Luan Pham

Reply all
Reply to author
Forward
0 new messages