nnet2 training with GPU leak memory (main memory).

Luan Pham

unread,

Mar 16, 2016, 10:27:08 AM3/16/16

to kaldi-help

Dear all,

I have this problem where our server would run out of memory and crash. A couple of research and this seems likely the problem: https://github.com/matth/kaldi-cuda-leak-example

To summary, I have many iterations in my training of nnet2. And I observed that after around 100 or so interactions, the main memory of the server increases, Our server has 132GB of memory but it still crashes after around 2 days of training. The big problem is that even if I stop all the jobs half way, all the memory was not reclaimed by the system (only main memory, the GPU memory is cleared). I think the problem is when using GPU, each iteration allocates a small memory but didn't free it after finishing. According to the documentation :

Rather than calling the malloc and free functions that NVidia provides, Kaldi does caching of previously released memory so that we don't have to incur the overhead of NVidia's malloc. .... Anyway, the memory caching can cause a problem if for some reason you run using the default (non-exclusive) compute mode, because it can cause allocation failures. You can disable it at the code level by callingCuDevice::Instantiate().DisableCaching(), if needed.

I don't have exclusive mode because the server is shared. But when I look into the code to add DisableCaching() function, the function is gone (according to this github commit)

So I just want a quick fix for this problem. Maybe after some trainings, I want to free the memory that was hold by the GPU and continue. Is it a good way to fix this problem?

Thank you,

Ruoho Ruotsi

unread,

Mar 16, 2016, 12:50:08 PM3/16/16

to kaldi-help

What are your GPU specifications? Which corpus are you using and are you using Kaldi HEAD?

I'm actively training with nnet2, using librispeech++. My GPU only has 6GB and I haven't run into this issue, but now I'm curious.

Luan Pham

unread,

Mar 16, 2016, 12:52:22 PM3/16/16

to kaldi-help

Here is the output from my nvidia-smi. I'm training TEDLIUM corpus

+------------------------------------------------------+

| NVIDIA-SMI 352.39 Driver Version: 352.39 |

|-------------------------------+----------------------+----------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

|===============================+======================+======================|

| 0 GeForce GTX 980 Off | 0000:03:00.0 Off | N/A |

| 16% 35C P0 52W / 270W | 14MiB / 4095MiB | 0% Default |

+-------------------------------+----------------------+----------------------+

| 1 GeForce GTX 980 Off | 0000:83:00.0 Off | N/A |

| 0% 34C P0 42W / 270W | 14MiB / 4095MiB | 0% Default |

+-------------------------------+----------------------+----------------------+

Daniel Povey

unread,

Mar 16, 2016, 2:09:39 PM3/16/16

to kaldi-help

Those warnings about allocation failures are just warnings, they are harmless.

It is impossible for a bug *any* userspace code, Kaldi or otherwise, to be the cause the problem you describe, because all memory is freed when a process exits, and Kaldi doesn't use long-running proceses to train.

I suspect your machine is failing for reasons unrelated to exhaustion of memory.

You cannot trust the 'free' memory reported by 'top' because of caching of files on disk. You should add up the free + buffers + cached to find out the truly free memory.

It's normal for the 'cached' memory to gradually increase, and it's harmless. If your machine is crashing and there is nothing in the log, it's likely due to hardware problems with your CUDA cards, or an insufficient power supply.

Dan

--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ruoho Ruotsi

unread,

Mar 16, 2016, 2:10:47 PM3/16/16

to kaldi-help

I think Kaldi expects exclusive mode, so overloading/sharing the GPU with too many jobs can cause some jobs to crash. See Dan's comments here: https://sourceforge.net/p/kaldi/discussion/1355348/thread/13866b7f/

I'm running in exclusive mode with a similar card to you:

+------------------------------------------------------+

| NVIDIA-SMI 352.79 Driver Version: 352.79 |

|-------------------------------+----------------------+----------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

|===============================+======================+======================|

| 0 GeForce GTX 980 Ti Off | 0000:03:00.0 On | N/A |

| 60% 83C P2 148W / 250W | 3369MiB / 6143MiB | 99% Default |

+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+

| Processes: GPU Memory |

| GPU PID Type Process name Usage |

|=============================================================================|

| 0 1764 G /usr/bin/X 772MiB |

| 0 3048 G compiz 351MiB |

| 0 3799 G ...s-passed-by-fd --v8-snapshot-passed-by-fd 187MiB |

| 0 6819 G ...vendor=NVIDIA --gpu-driver-version=352.79 239MiB |

| 0 13143 G /usr/lib/firefox/firefox 2MiB |

| 0 21862 C nnet-train-simple 295MiB |

| 0 21863 C nnet-train-simple 295MiB |

| 0 21865 C nnet-train-simple 295MiB |

| 0 21872 C nnet-train-simple 295MiB |

| 0 21881 C nnet-train-simple 295MiB |

| 0 21883 C nnet-train-simple 295MiB |

| 0 28697 G /usr/bin/nvidia-settings 6MiB |

+-----------------------------------------------------------------------------+

Daniel Povey

unread,

Mar 16, 2016, 2:15:23 PM3/16/16

to kaldi-help

Just noticed that part- there is no need to disable the caching any more as it never caches very much memory any more. Anyway it's not the cause of your problems.

Dan

--

Luan Pham

unread,

Mar 17, 2016, 6:19:03 AM3/17/16

to kaldi-help, dpo...@gmail.com

@Runoho: I changed the variable --num-jobs-nnet to 4. When this variable lower, the number of iterations increase. In my case, when I change num-jobs-nnet from 16 to 4, the number of iterations increase from 200 to around 1900. The problem only happened when you have a lot of iterations. Most of the time, the training will finish before the server runs out of memory. But even in a small number of iteration like 150 - 300 iterations, you can still see the leak. I think you can compare your memory usage before and after your training. I trained my model and stop when it hit around 140 iterations and I noticed and increased of 25GB of RAM uage. And after I stop the processes, the high RAM usage remains the same.

@Dan: Thanks for your response. I will try to train in a different server and attach some screenshot next week cause the server is currently being used for some different project. I still think the problem is caused by leaking memory. I used htop to monitor the memory usage and it can distinct 3 types of memory: used memory, cached and buffer. And I can see the used memory behave normally with other processes: increase when launched and decreased when exit. But the memory increased by the training process is not reclaimed even if the training process is killed or terminated. So when I train a new model or continue my training, the leak memory keeps stacking up. As mentioned in previous posts, this happened when I have a lot of iterations (I have around 1900 iterations in 1 training because I decrease the --num-jobs-nnet). Most of the time, when I used the defaulted script, the number of iterations is only around 200 so the process finished before out of memory. But even in that case, I can still see the used memory increase in the server.

Daniel Povey

unread,

Mar 17, 2016, 7:42:34 PM3/17/16

to Luan Pham, kaldi-help

The output of the two commands may help track down where the memory is going.

cat /proc/slabinfo | awk '{x=$4*$14; print x, $0; }' | sort -nr

ps -Al | awk '$10 > 1000000'

Luan Pham

unread,

Mar 18, 2016, 1:36:10 PM3/18/16

to dpo...@gmail.com, kaldi-help

Dear Daniel,

Sorry for the late reply. I don't have root so that I need to wait for the person with root access.

The output of slabinfo:

380955576 nfs_inode_cache 1107408 1107429 1032 3 1 : tunables 24 12 8 : slabdata 369143 369143 0

317128704 kmalloc-16384 19356 19356 16384 1 4 : tunables 8 4 0 : slabdata 19356 19356 0

85393408 kmalloc-65536 1303 1303 65536 1 16 : tunables 8 4 0 : slabdata 1303 1303 0

35709696 radix_tree_node 433828 433972 576 7 1 : tunables 54 27 8 : slabdata 61996 61996 0

35504128 kmalloc-8192 4334 4334 8192 1 2 : tunables 8 4 0 : slabdata 4334 4334 0

32837632 kmalloc-4096 8017 8017 4096 1 1 : tunables 24 12 8 : slabdata 8017 8017 0

15500000 ext4_inode_cache 61906 62000 1000 4 1 : tunables 54 27 8 : slabdata 15500 15500 0

11418240 dentry 1246692 1248870 192 21 1 : tunables 120 60 8 : slabdata 59470 59470 0

5943296 xfs_inode 23092 23216 1024 4 1 : tunables 54 27 8 : slabdata 5804 5804 0

4314624 kmalloc-512 67251 67416 512 8 1 : tunables 54 27 8 : slabdata 8427 8427 0

4128768 kmalloc-2048 3909 4032 2048 2 1 : tunables 24 12 8 : slabdata 2016 2016 0

3485352 buffer_head 1301018 1307007 104 39 1 : tunables 120 60 8 : slabdata 33513 33513 0

1992704 kmalloc-1024 7381 7784 1024 4 1 : tunables 54 27 8 : slabdata 1946 1946 0

1809088 nvidia_pte_cache 7009760 7010216 32 124 1 : tunables 120 60 8 : slabdata 56534 56534 0

1706752 kmalloc-256 105614 106672 256 16 1 : tunables 120 60 8 : slabdata 6667 6667 0

1433088 inode_cache 17359 17416 576 7 1 : tunables 54 27 8 : slabdata 2488 2488 0

685392 idr_layer_cache 939 981 2096 3 2 : tunables 24 12 8 : slabdata 327 327 0

663488 kmalloc-64 652311 653121 64 63 1 : tunables 120 60 8 : slabdata 10367 10367 0

536320 proc_inode_cache 4121 5028 640 6 1 : tunables 54 27 8 : slabdata 838 838 0

505920 kmalloc-192 54175 55335 192 21 1 : tunables 120 60 8 : slabdata 2635 2635 0

498256 task_struct 507 627 2384 3 2 : tunables 24 12 8 : slabdata 209 209 0

416064 sighand_cache 497 591 2112 3 2 : tunables 24 12 8 : slabdata 197 197 0

403712 kmalloc-96 96723 97774 128 31 1 : tunables 120 60 8 : slabdata 3154 3154 0

262144 kmalloc-262144 1 1 262144 1 64 : tunables 1 1 0 : slabdata 1 1 0

262144 kmalloc-131072 2 2 131072 1 32 : tunables 8 4 0 : slabdata 2 2 0

260672 kmalloc-32 1009578 1010104 32 124 1 : tunables 120 60 8 : slabdata 8146 8146 0

221184 biovec-256 54 54 4096 1 1 : tunables 24 12 8 : slabdata 54 54 0

163840 kmalloc-32768 5 5 32768 1 8 : tunables 8 4 0 : slabdata 5 5 0

160688 shmem_inode_cache 1248 1452 664 6 1 : tunables 54 27 8 : slabdata 242 242 0

144720 Acpi-Operand 111808 112560 72 56 1 : tunables 120 60 8 : slabdata 2010 2010 0

139800 kernfs_node_cache 38390 38445 120 33 1 : tunables 120 60 8 : slabdata 1165 1165 0

136952 xfs_ili 23090 23426 152 26 1 : tunables 120 60 8 : slabdata 901 901 0

112896 signal_cache 497 686 1152 7 2 : tunables 24 12 8 : slabdata 98 98 0

106176 kmem_cache 233 237 1344 3 1 : tunables 24 12 8 : slabdata 79 79 0

83200 task_xstate 511 900 832 9 2 : tunables 54 27 8 : slabdata 100 100 0

77824 names_cache 19 19 4096 1 1 : tunables 24 12 8 : slabdata 19 19 0

74240 sock_inode_cache 581 696 640 6 1 : tunables 54 27 8 : slabdata 116 116 0

64512 RAW 288 288 896 4 1 : tunables 54 27 8 : slabdata 72 72 0

62976 vm_area_struct 4599 6888 192 21 1 : tunables 120 60 8 : slabdata 328 328 80

62208 filp 1344 3888 256 16 1 : tunables 120 60 8 : slabdata 243 243 0

44544 xfs_buf 1122 1160 384 10 1 : tunables 54 27 8 : slabdata 116 116 0

40576 kmalloc-128 9165 9827 128 31 1 : tunables 120 60 8 : slabdata 317 317 0

30720 files_cache 199 288 640 6 1 : tunables 54 27 8 : slabdata 48 48 0

28800 UNIX 86 120 960 4 1 : tunables 54 27 8 : slabdata 30 30 0

26880 mm_struct 110 112 960 4 1 : tunables 54 27 8 : slabdata 28 28 0

23936 RAWv6 147 154 1088 7 2 : tunables 24 12 8 : slabdata 22 22 0

16704 cred_jar 691 1827 192 21 1 : tunables 120 60 8 : slabdata 87 87 0

13888 blkdev_queue 8 14 1984 2 1 : tunables 24 12 8 : slabdata 7 7 0

12544 TCP 13 14 1792 2 1 : tunables 24 12 8 : slabdata 7 7 0

12288 nvidia_stack_cache 1 1 12288 1 4 : tunables 8 4 0 : slabdata 1 1 0

11904 TCPv6 8 12 1984 2 1 : tunables 24 12 8 : slabdata 6 6 0

10240 rpc_buffers 10 10 2048 2 1 : tunables 24 12 8 : slabdata 5 5 0

8512 anon_vma_chain 3505 8379 64 63 1 : tunables 120 60 8 : slabdata 133 133 80

8192 sgpool-128 2 2 4096 1 1 : tunables 24 12 8 : slabdata 2 2 0

8080 anon_vma 2282 5050 80 50 1 : tunables 120 60 8 : slabdata 101 101 32

7488 bdev_cache 13 36 832 4 1 : tunables 54 27 8 : slabdata 9 9 0

7168 nfs_write_data 32 32 896 4 1 : tunables 54 27 8 : slabdata 8 8 0

7080 ext4_extent_status 17065 17523 40 99 1 : tunables 120 60 8 : slabdata 177 177 0

7056 ext4_groupinfo_4k 1330 1372 144 28 1 : tunables 120 60 8 : slabdata 49 49 0

6272 UDP 20 28 896 4 1 : tunables 54 27 8 : slabdata 7 7 0

5440 UDPv6 9 35 1088 7 2 : tunables 24 12 8 : slabdata 5 5 0

5376 task_delay_info 512 1728 112 36 1 : tunables 120 60 8 : slabdata 48 48 0

5376 pid 375 1302 128 31 1 : tunables 120 60 8 : slabdata 42 42 0

4640 cfq_queue 126 340 232 17 1 : tunables 120 60 8 : slabdata 20 20 0

4480 rpc_inode_cache 25 42 640 6 1 : tunables 54 27 8 : slabdata 7 7 0

4000 Acpi-Namespace 9833 9900 40 99 1 : tunables 120 60 8 : slabdata 100 100 0

I cut short the output with small memory. The other command yields no output cause the server does not run jobs. Yet, the server still uses 30GB memory. Looks like the kmalloc is the culprit?

PS: I attach the full output.

--

Luan Pham

slabinfo

Daniel Povey

unread,

Mar 18, 2016, 2:22:30 PM3/18/16

to Luan Pham, kaldi-help

kmalloc-* are likely just pools of memory that the kernel allocates from. Everything looks to me normal there; I don't see a problem.
Dan

Daniel Povey

unread,

Mar 18, 2016, 3:05:23 PM3/18/16

to Luan Pham, kaldi-help

Actually, I think I was wrong.

kmalloc can only be called by kernel or driver code. That memory is actually being used by the kernel, or drivers (active-slabs == num-slabs). And it's a huge amount of memory.

That points to a kernel or driver bug, which is what I originally thought.

There's not much you can do apart from reinstalling the NVidia drivers with a newer (or at least different) version- that's the most likely cause.

Dan

Ruoho Ruotsi

unread,

Mar 18, 2016, 3:06:45 PM3/18/16

to kaldi-help, thanhl...@gmail.com, dpo...@gmail.com

Hi Luan,

here are the debugging ideas that come to mind:

Rule out your NVIDIA driver being a factor, nvidia-smi reports your driver is: Driver Version: 352.39 (i.e from last summer, so pretty old) You can look at newer drivers or the latest (http://www.nvidia.com/Download/index.aspx?lang=en-us) I've found my version 352.79, to be good enough with CUDA 7.5
For memory usage profiling, that one can understand, even for a complex project like Kaldi, I'd recommend a tool like: https://developer.nvidia.com/nvidia-visual-profiler
Rule out Kaldi. This means try to reproduce the issues (via large memory reads/writes, etc) from a simple CUDA program. For example, modifying one of the NVIDIA_CUDA-7.5_Samples will be easier to understand/debug and you will quickly understand your card's performance limits. If you can repro the issue in a small stand-alone program, you can also more quickly test out any resolutions
Rule out exclusive vs shared mode. Do you know who else is submitting jobs? Is the server accepting asynchronous requests? Can you guarantee you're the only one working on it?

Daniel Povey

unread,

Mar 18, 2016, 3:07:58 PM3/18/16

to Ruoho Ruotsi, kaldi-help, Luan Pham

All userspace programs (which includes Kaldi) are already ruled out by the diagnostics he ran. That memory is owned in kernel space -> must be driver (likely) or kernel (unlikely).
Dan

Luan Pham

unread,

Mar 18, 2016, 3:11:13 PM3/18/16

to Daniel Povey, Ruoho Ruotsi, kaldi-help

I agree with both of you that it's likely the driver is the problem. I will install a new driver next Monday and let you know.

Thank you for your help,

--

Luan Pham

Reply all

Reply to author

Forward