Error while doing DNN training

Shreya Singhal

unread,

Oct 6, 2023, 3:29:30 AM10/6/23

to kaldi-help

While Traning chain model for my bengali dataset I am getting plagued by instability issues (errors like "Tridiagonalizing matrix that is too large or has NaNs" and "Cholesky decomposition failed. Maybe matrix is not positive definite"), Has anyone faced such issues? As per google groups reducing learning rate/ changing model topology could help

Daniel Povey

unread,

Oct 6, 2023, 3:46:03 AM10/6/23

to kaldi...@googlegroups.com

Usually that is how instability will first show itself. If you are using a quite normal recipe (e.g. TDNN) it is odd that you would get instability.

There might be something wrong about the data in that case (especially if you have successfully trained systems before).

But reducing learning rate slightly may help.

On Fri, Oct 6, 2023 at 3:29 PM Shreya Singhal <shreyas...@gmail.com> wrote:

While Traning chain model for my bengali dataset I am getting plagued by instability issues (errors like "Tridiagonalizing matrix that is too large or has NaNs" and "Cholesky decomposition failed. Maybe matrix is not positive definite"), Has anyone faced such issues? As per google groups reducing learning rate/ changing model topology could help

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/78189510-3ddb-4815-a715-7e9a6e90d715n%40googlegroups.com.

Shreya Singhal

unread,

Oct 6, 2023, 4:01:23 AM10/6/23

to kaldi-help

I am getting these errors in the initial iterations itself, Could it be instability or something else?

Daniel Povey

unread,

Oct 6, 2023, 4:26:39 AM10/6/23

to kaldi...@googlegroups.com

could be an issue with blas installation, try running "make test" in src/matrix/, it may fail ther.e

next time show some pasted output

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/1bf79454-79ea-4377-9d59-3847439cdaabn%40googlegroups.com.

Shreya Singhal

unread,

Oct 6, 2023, 4:50:09 AM10/6/23

to kaldi-help

~/kaldi/src/matrix$ make test
Running matrix-lib-test ... 2s... SUCCESS matrix-lib-test
Running sparse-matrix-test ... 0s... SUCCESS sparse-matrix-test
Running numpy-array-test ... 0s... SUCCESS numpy-array-test

I tried use_gpu no and the chain training starts even with a much higher lr

I reckon it's an issue with gpu usage

Message has been deleted

Shreya Singhal

unread,

Oct 9, 2023, 8:21:37 AM10/9/23

to kaldi-help

I tried running the make test, getting the following as the output:

Running cu-vector-test .../bin/bash: line 1: 13291 Aborted (core dumped) ./$x > $x.testlog 2>&1
3s... FAIL cu-vector-test
Running cu-matrix-test .../bin/bash: line 1: 13303 Aborted (core dumped) ./$x > $x.testlog 2>&1
6s... FAIL cu-matrix-test
Running cu-math-test ... 8s... SUCCESS cu-math-test
Running cu-test ... 2s... SUCCESS cu-test
Running cu-sp-matrix-test ... 2s... SUCCESS cu-sp-matrix-test
Running cu-packed-matrix-test ... 2s... SUCCESS cu-packed-matrix-test
Running cu-tp-matrix-test ... 3s... SUCCESS cu-tp-matrix-test
Running cu-block-matrix-test ... 1s... SUCCESS cu-block-matrix-test
Running cu-matrix-speed-test .../bin/bash: line 1: 13368 Aborted (core dumped) ./$x > $x.testlog 2>&1
26s... FAIL cu-matrix-speed-test
Running cu-vector-speed-test ... 10s... SUCCESS cu-vector-speed-test
Running cu-sp-matrix-speed-test ... 1s... SUCCESS cu-sp-matrix-speed-test
Running cu-array-test ... 1s... SUCCESS cu-array-test
Running cu-sparse-matrix-test .../bin/bash: line 1: 13411 Aborted (core dumped) ./$x > $x.testlog 2>&1
2s... FAIL cu-sparse-matrix-test
Running cu-device-test ... 5s... SUCCESS cu-device-test
Running cu-rand-speed-test ... 1s... SUCCESS cu-rand-speed-test
Running cu-compressed-matrix-test ... 1s... SUCCESS cu-compressed-matrix-test
make: *** [../makefiles/default_rules.mk:104: test] Error 1

After this i tried running ./cu-array-test , getting the following as output:

LOG ([5.5.1074~1-71f3]:SelectGpuId():cu-device.cc:168) Manually selected to compute on CPU.
LOG ([5.5.1074~1-71f3]:main():cu-array-test.cc:136) Tests without GPU use succeeded.
LOG ([5.5.1074~1-71f3]:SelectGpuId():cu-device.cc:238) CUDA setup operating under Compute Exclusive Mode.
LOG ([5.5.1074~1-71f3]:FinalizeActiveGpu():cu-device.cc:338) The active GPU is [0]: NVIDIA A40 free:45099M, used:316M, total:45416M, free/total:0.993027 version 8.6
LOG ([5.5.1074~1-71f3]:main():cu-array-test.cc:138) Tests with GPU use (if available) succeeded.
LOG ([5.5.1074~1-71f3]:PrintProfile():cu-device.cc:563) -----
[cudevice profile]
CuArray::SetZero 0.000510931s
CopyFromArray 0.000697374s
CopyFromVec 0.00251794s
CuArray::CopyToVecD2H 0.00295353s
Set 0.0189197s
CuArray::Resize 0.0281687s
Total GPU time: 0.0537682s (may involve some double-counting)
-----
LOG ([5.5.1074~1-71f3]:PrintMemoryUsage():cu-allocator.cc:340) Memory usage: 0/23645388800 bytes currently allocated/total-held; 0/1 blocks currently allocated/free; largest free/allocated block sizes are 0/23645388800; time taken total/cudaMalloc is 0.0278888/0.0276079, synchronized the GPU 0 times out of 190 frees; device memory info: free:22549M, used:22866M, total:45416M, free/total:0.496507maximum allocated: 1024current allocated: 0

Shreya Singhal

unread,

Oct 16, 2023, 3:33:46 AM10/16/23

to kaldi-help

I got errors in these 4:
FAIL cu-vector-test
FAIL cu-matrix-test
FAIL cu-matrix-speed-test
FAIL cu-sparse-matrix-test

got this after running ./cu-vector-test :

``` ./cu-vector-test LOG (cu-vector-test[5.5.1074~1-71f3]:SelectGpuId():cu-device.cc:168) Manually selected to compute on CPU. -1.05384e+09 -1.05384e+09 -2.15126e+08 -2.15126e+08 LOG (cu-vector-test[5.5.1074~1-71f3]:main():cu-vector-test.cc:868) Tests without GPU use succeeded. LOG (cu-vector-test[5.5.1074~1-71f3]:SelectGpuId():cu-device.cc:238) CUDA setup operating under Compute Exclusive Mode. LOG (cu-vector-test[5.5.1074~1-71f3]:FinalizeActiveGpu():cu-device.cc:338) The active GPU is [0]: NVIDIA A40 free:45099M, used:316M, total:45416M, free/total:0.993027 version 8.6 1.35243e+08 1.35243e+08 ASSERTION_FAILED (cu-vector-test[5.5.1074~1-71f3]:AssertEqual():matrix/kaldi-vector.h:584) Assertion failed: (a.ApproxEqual(b, tol)) [ Stack-Trace: ] /fra/clusterdev/centos/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x98a) [0x7fdf7a075c4e] /fra/clusterdev/centos/kaldi/src/lib/libkaldi-base.so(kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)+0x5b) [0x7fdf7a076770] ./cu-vector-test(void kaldi::CuVectorUnitTestAddDiagMat2<float>()+0x265) [0x412466] ./cu-vector-test(void kaldi::CuVectorUnitTest<float>()+0x115e) [0x4197ac] ./cu-vector-test(main+0x292) [0x40b0dc] /lib64/libc.so.6(__libc_start_main+0xf5) [0x7fdf41fdf555] ./cu-vector-test() [0x40ad39] Aborted (core dumped)```

Cuda version is: 12.2.0

Compute capability GPU is 8.6
GPU is NVIDIA A40

What can be the possible reason for these errors?
how can we debug these?

Shreya Singhal

unread,

Oct 20, 2023, 2:11:47 AM10/20/23

to kaldi-help

can you help me debug GPU, can a reinstallation help?

Daniel Povey

unread,

Oct 20, 2023, 7:39:01 AM10/20/23

to kaldi...@googlegroups.com

You could try reinstalling Kaldi.

If that doesn't work, install a different CUDA toolkit version and then reinstalling Kaldi may help.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/c7e7611a-4fc7-4e66-a820-4f3bb276e7acn%40googlegroups.com.

Daniel Povey

unread,

Oct 20, 2023, 7:40:23 AM10/20/23

to kaldi...@googlegroups.com

you could also try putting some print statements in the failed tests, e.g.

std::cout << mat1;

std::cout << mat2;

if it is comparing mat1 vs mat2, for example, to see how different they are, e.g. is it a question of the tolerance being too small, vs. a completely wrong answer?

Message has been deleted

Shreya Singhal

unread,

Oct 23, 2023, 4:20:49 AM10/23/23

to kaldi-help

(base) [centos@localhost cudamatrix]$ ldd ./cu-matrix-test

linux-vdso.so.1 => (0x00007ffd1bd76000)
libkaldi-cudamatrix.so => /fra/clusterdev/centos/kaldi/src/lib/libkaldi-cudamatrix.so (0x00007f5b47002000)
libkaldi-util.so => /fra/clusterdev/centos/kaldi/src/lib/libkaldi-util.so (0x00007f5b46dc9000)
libkaldi-matrix.so => /fra/clusterdev/centos/kaldi/src/lib/libkaldi-matrix.so (0x00007f5b46b22000)
libkaldi-base.so => /fra/clusterdev/centos/kaldi/src/lib/libkaldi-base.so (0x00007f5b46906000)
libfst.so.16 => /fra/clusterdev/centos/kaldi/tools/openfst-1.7.2/lib/libfst.so.16 (0x00007f5b46576000)
libmkl_intel_lp64.so => /opt/intel/mkl/lib/intel64/libmkl_intel_lp64.so (0x00007f5b45a0a000)
libmkl_core.so => /opt/intel/mkl/lib/intel64/libmkl_core.so (0x00007f5b416ea000)
libmkl_sequential.so => /opt/intel/mkl/lib/intel64/libmkl_sequential.so (0x00007f5b400d2000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f5b3fece000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f5b3fcb2000)
libm.so.6 => /lib64/libm.so.6 (0x00007f5b3f9b0000)
libcuda.so.1 => /lib64/libcuda.so.1 (0x00007f5b3dd53000)
libcublas.so.12 => /usr/local/cuda/lib64/libcublas.so.12 (0x00007f5b3758f000)
libcusparse.so.12 => /usr/local/cuda/lib64/libcusparse.so.12 (0x00007f5b277ef000)
libcusolver.so.11 => /usr/local/cuda/lib64/libcusolver.so.11 (0x00007f5b207be000)
libcudart.so.12 => /usr/local/cuda/lib64/libcudart.so.12 (0x00007f5b20516000)
libcurand.so.10 => /usr/local/cuda/lib64/libcurand.so.10 (0x00007f5b1a080000)
libcufft.so.11 => /usr/local/cuda/lib64/libcufft.so.11 (0x00007f5b0f350000)
libnvToolsExt.so.1 => /usr/local/cuda/lib64/libnvToolsExt.so.1 (0x00007f5b0f146000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f5b0ee3e000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f5b0ec28000)
libc.so.6 => /lib64/libc.so.6 (0x00007f5b0e85a000)
/lib64/ld-linux-x86-64.so.2 (0x00007f5b48c14000)
librt.so.1 => /lib64/librt.so.1 (0x00007f5b0e652000)
libcublasLt.so.12 => /usr/local/cuda/lib64/libcublasLt.so.12 (0x00007f5aeb705000)
libnvJitLink.so.12 => /usr/local/cuda/lib64/libnvJitLink.so.12 (0x00007f5ae853d000)

could it be an issue with mkl?
should I instead use https://catalog.ngc.nvidia.com/orgs/nvidia/containers/kaldi?

Shreya Singhal

unread,

Oct 23, 2023, 4:35:58 AM10/23/23

to kaldi-help

also, this machine has Centos instead of Ubuntu

Daniel Povey

unread,

Oct 23, 2023, 5:50:41 AM10/23/23

to kaldi...@googlegroups.com

I think the issue is there is older CUDA code in the cudamatrix library that does not work correctly on the Volta or Ampere architectures

because you now need to explicitly synchronize within a warp (previously, the synchronization was implicit).

cu-kernels.cu line 1211:

// Warp reduce to 1 element. Threads implicitly synchronized within a warp.

if (tid < warpSize) {

# pragma unroll

for (int shift = warpSize; shift > 0; shift >>= 1) {

ssum[tid] += ssum[tid + shift];

}

.. so basically we'd now need to add on the line below the "for" statement:

__syncwarp()

but this would have to be done on many other kernels.

Perhaps you can help with this?

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/57a0ecd0-dec7-423e-b621-fe0faf0950c7n%40googlegroups.com.

Message has been deleted

Shreya Singhal

unread,

Oct 23, 2023, 8:26:36 AM10/23/23

to kaldi-help

I tried adding this line to where you pointed, but nothing changed, "but this would have to be done on many kernels" I don't understand this, this is what I did:

in cu-kernels.cu

// Warp reduce to 1 element. Threads implicitly synchronized within a warp.
if (tid < warpSize) {
# pragma unroll
for (int shift = warpSize; shift > 0; shift >>= 1) {

__syncwarp();

Daniel Povey

unread,

Oct 24, 2023, 11:32:55 AM10/24/23

to kaldi...@googlegroups.com

you'd first want to see that it reduced the number of errors in the tests.

Obviously you'd have to recompile, also. ("make")

There may be required synchronization statements like this in other kernels. You can find them by seeing what tests fail and tracing the code inside the test to see what kernel it eventually invokes.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/9d8abd06-955c-4f85-9833-3e9a0c2398fcn%40googlegroups.com.

Kumar

unread,

Oct 30, 2023, 12:17:19 AM10/30/23

to kaldi-help

I am also facing the same issue, same nvidia driver versions.

added __syncwarp(); in cu-kernels.cu

// Warp reduce to 1 element. Threads implicitly synchronized within a warp.
if (tid < warpSize) {
# pragma unroll
for (int shift = warpSize; shift > 0; shift >>= 1) {

__syncwarp();

after recompiling also the same tests failed i.e., cu-vector-test, cu-matrix-test, cu-matrix-speed-test, cu-sparse-matrix-test

Any further thoughts on how to solve this? (Training running normal if running on cpu)

Kumar

unread,

Oct 30, 2023, 12:34:43 AM10/30/23

to kaldi-help

attaching tests logs for ref

cu-matrix-test_log.txt

cu-vector-test_log.txt

cu-matrix-speed-test_log.txt

cu-sparse-matrix-test_log.txt

Daniel Povey

unread,

Oct 30, 2023, 2:56:43 AM10/30/23

to kaldi...@googlegroups.com

That line was just an example,

I gave more detailed instructions here:

https://groups.google.com/g/kaldi-help/c/CTNIZb9kUY8/m/43A8tuEkAQAJ?utm_medium=email&utm_source=footer

Was that you also? Notice that I requested more information about the cu-sparse-matrix test failure in that thread, I didn't get a response.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/616ce305-fd65-4d27-82f3-ec532d1f3c14n%40googlegroups.com.

Kumar

unread,

Oct 30, 2023, 3:33:14 AM10/30/23

to kaldi-help

I read that thread also and did the same testing first I ran 'make test' in cudamatrix directory and failed these 4 tests cu-vector-test, cu-matrix-test, cu-matrix-speed-test, cu-sparse-matrix-test

------------------- make test in cudamatrix log---------------

user@user-All-Series:/mnt/sd1/kaldi/src/cudamatrix$ make test
Running cu-vector-test .../bin/bash: line 2: 3207724 Aborted (core dumped) ./$x > $x.testlog 2>&1
1s... FAIL cu-vector-test
Running cu-matrix-test .../bin/bash: line 2: 3207748 Aborted (core dumped) ./$x > $x.testlog 2>&1
4s... FAIL cu-matrix-test
Running cu-math-test ... 10s... SUCCESS cu-math-test

Running cu-test ... 2s... SUCCESS cu-test
Running cu-sp-matrix-test ... 2s... SUCCESS cu-sp-matrix-test

Running cu-packed-matrix-test ... 1s... SUCCESS cu-packed-matrix-test
Running cu-tp-matrix-test ... 2s... SUCCESS cu-tp-matrix-test
Running cu-block-matrix-test ... 2s... SUCCESS cu-block-matrix-test
Running cu-matrix-speed-test .../bin/bash: line 2: 3207971 Aborted (core dumped) ./$x > $x.testlog 2>&1
27s... FAIL cu-matrix-speed-test
Running cu-vector-speed-test ... 9s... SUCCESS cu-vector-speed-test

Running cu-sp-matrix-speed-test ... 1s... SUCCESS cu-sp-matrix-speed-test

Running cu-array-test ... 0s... SUCCESS cu-array-test
Running cu-sparse-matrix-test .../bin/bash: line 2: 3208286 Aborted (core dumped) ./$x > $x.testlog 2>&1
0s... FAIL cu-sparse-matrix-test

Running cu-device-test ... 5s... SUCCESS cu-device-test
Running cu-rand-speed-test ... 1s... SUCCESS cu-rand-speed-test
Running cu-compressed-matrix-test ... 1s... SUCCESS cu-compressed-matrix-test
make: *** [../makefiles/default_rules.mk:104: test] Error 1

the 4 tests failed, I did seperate testing as per the instructions and the logs are attached in my previous message.

In that thread you have asked that person to re-run cu-sparse-matrix-test after setting CUDA_LAUNCH_BLOCKING=1

following is the test log for the same

-------------------

user@user-All-Series:/mnt/sd1/kaldi/src/cudamatrix$ export CUDA_LAUNCH_BLOCKING=1
user@user-All-Series:/mnt/sd1/kaldi/src/cudamatrix$ ./cu-sparse-matrix-test
LOG ([5.5.1076~1-1b07b5]:SelectGpuId():cu-device.cc:168) Manually selected to compute on CPU.
LOG ([5.5.1076~1-1b07b5]:main():cu-sparse-matrix-test.cc:309) Tests without GPU use succeeded.
LOG ([5.5.1076~1-1b07b5]:SelectGpuId():cu-device.cc:238) CUDA setup operating under Compute Exclusive Mode.
LOG ([5.5.1076~1-1b07b5]:FinalizeActiveGpu():cu-device.cc:338) The active GPU is [0]: NVIDIA GeForce GTX 1070 free:7967M, used:144M, total:8112M, free/total:0.982187 version 6.1
ASSERTION_FAILED ([5.5.1076~1-1b07b5]:AssertEqual():base/kaldi-math.h:279) Assertion failed: (ApproxEqual(a, b, relative_tolerance))

[ Stack-Trace: ]
/mnt/sd1/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x70c) [0x7f9243cba1ce]
/mnt/sd1/kaldi/src/lib/libkaldi-base.so(kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)+0x72) [0x7f9243cbab59]
./cu-sparse-matrix-test(+0x6cc1) [0x55d604994cc1]
./cu-sparse-matrix-test(void kaldi::CudaSparseMatrixUnitTest<float>()+0x35) [0x55d604998a42]
./cu-sparse-matrix-test(main+0xab) [0x55d604994154]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f9241429d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f9241429e40]
./cu-sparse-matrix-test(_start+0x25) [0x55d604993fe5]

Aborted (core dumped)

Daniel Povey

unread,

Oct 30, 2023, 5:04:43 AM10/30/23

to kaldi...@googlegroups.com

Hm, OK.

If you configure with --debug-level=2 this will give us the names of the specific tests that failed in the assertion message.

can you do that?

it will indicate which kernels need to be fixed.

The debugging process involves going into the functions that failed (with CUDA_LAUNCH_BLOCKING=1) and figuring out

exactly what kernels were called. it may even be possible to run the failed tests in gdb to find the exact code path, e.g.

gdb ./cu-matrix-speed-test

(gdb) catch throw

(gdb) r

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/d98c3419-dfc7-4246-a524-4baf1db85908n%40googlegroups.com.

Kumar

unread,

Oct 30, 2023, 8:04:33 AM10/30/23

to kaldi-help

Here is the the debugging of cu-sparse-matrix-test via gdb

-----------------------------------------------------------------

user@user-All-Series:/mnt/sd1/kaldi/src/cudamatrix$ export CUDA_LAUNCH_BLOCKING=1
user@user-All-Series:/mnt/sd1/kaldi/src/cudamatrix$ gdb ./cu-sparse-matrix-test
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./cu-sparse-matrix-test...
(gdb) catch throw
Catchpoint 1 (throw)
(gdb) r
Starting program: /mnt/sd1/kaldi/src/cudamatrix/cu-sparse-matrix-test
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

LOG ([5.5.1076~1-1b07b5]:SelectGpuId():cu-device.cc:168) Manually selected to compute on CPU.
LOG ([5.5.1076~1-1b07b5]:main():cu-sparse-matrix-test.cc:309) Tests without GPU use succeeded.

[New Thread 0x7fff9915e000 (LWP 3292637)]
[New Thread 0x7fff9895d000 (LWP 3292638)]
[New Thread 0x7fff93d5e000 (LWP 3292639)]

LOG ([5.5.1076~1-1b07b5]:SelectGpuId():cu-device.cc:238) CUDA setup operating under Compute Exclusive Mode.
LOG ([5.5.1076~1-1b07b5]:FinalizeActiveGpu():cu-device.cc:338) The active GPU is [0]: NVIDIA GeForce GTX 1070 free:7967M, used:144M, total:8112M, free/total:0.982187 version 6.1
ASSERTION_FAILED ([5.5.1076~1-1b07b5]:AssertEqual():base/kaldi-math.h:279) Assertion failed: (ApproxEqual(a, b, relative_tolerance))

[ Stack-Trace: ]

/mnt/sd1/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x70c) [0x7ffff7f0a1ce]
/mnt/sd1/kaldi/src/lib/libkaldi-base.so(kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)+0x72) [0x7ffff7f0ab59]
/mnt/sd1/kaldi/src/cudamatrix/cu-sparse-matrix-test(+0x6cc1) [0x55555555acc1]
/mnt/sd1/kaldi/src/cudamatrix/cu-sparse-matrix-test(void kaldi::CudaSparseMatrixUnitTest<float>()+0x35) [0x55555555ea42]
/mnt/sd1/kaldi/src/cudamatrix/cu-sparse-matrix-test(main+0xab) [0x55555555a154]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7ffff5829d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7ffff5829e40]
/mnt/sd1/kaldi/src/cudamatrix/cu-sparse-matrix-test(_start+0x25) [0x555555559fe5]

Thread 1 "cu-sparse-matri" received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352499200) at ./nptl/pthread_kill.c:44
44 ./nptl/pthread_kill.c: No such file or directory.
(gdb)

Daniel Povey

unread,

Oct 30, 2023, 10:57:11 PM10/30/23

to kaldi...@googlegroups.com

You have to show the backtrace, e.g. "bt" at the gdb prompt after the failure

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/900befde-ce85-4bb2-a843-e7665a980ba7n%40googlegroups.com.

Kumar

unread,

Oct 30, 2023, 11:07:38 PM10/30/23

to kaldi-help

Here is bt after failure. I am actually not familiar with c++, I can do the instructed steps, but couldn't understand the failures.

----------------------------------------------------------

[New Thread 0x7fff9915e000 (LWP 4062589)]
[New Thread 0x7fff9895d000 (LWP 4062590)]
[New Thread 0x7fff93d5e000 (LWP 4062591)]

LOG ([5.5.1076~1-1b07b5]:SelectGpuId():cu-device.cc:238) CUDA setup operating under Compute Exclusive Mode.
LOG ([5.5.1076~1-1b07b5]:FinalizeActiveGpu():cu-device.cc:338) The active GPU is [0]: NVIDIA GeForce GTX 1070 free:7967M, used:144M, total:8112M, free/total:0.982187 version 6.1
ASSERTION_FAILED ([5.5.1076~1-1b07b5]:AssertEqual():base/kaldi-math.h:279) Assertion failed: (ApproxEqual(a, b, relative_tolerance))

[ Stack-Trace: ]
/mnt/sd1/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x70c) [0x7ffff7f0a1ce]
/mnt/sd1/kaldi/src/lib/libkaldi-base.so(kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)+0x72) [0x7ffff7f0ab59]
/mnt/sd1/kaldi/src/cudamatrix/cu-sparse-matrix-test(+0x6cc1) [0x55555555acc1]
/mnt/sd1/kaldi/src/cudamatrix/cu-sparse-matrix-test(void kaldi::CudaSparseMatrixUnitTest<float>()+0x35) [0x55555555ea42]
/mnt/sd1/kaldi/src/cudamatrix/cu-sparse-matrix-test(main+0xab) [0x55555555a154]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7ffff5829d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7ffff5829e40]
/mnt/sd1/kaldi/src/cudamatrix/cu-sparse-matrix-test(_start+0x25) [0x555555559fe5]

Thread 1 "cu-sparse-matri" received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352499200) at ./nptl/pthread_kill.c:44
44 ./nptl/pthread_kill.c: No such file or directory.

(gdb) bt
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352499200) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=140737352499200) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=140737352499200, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3 0x00007ffff5842476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x00007ffff58287f3 in __GI_abort () at ./stdlib/abort.c:79
#5 0x00007ffff7f0ab72 in kaldi::KaldiAssertFailure_ (func=func@entry=0x55555556205b "AssertEqual", file=file@entry=0x555555562046 "../base/kaldi-math.h",
line=line@entry=279, cond_str=cond_str@entry=0x555555562020 "ApproxEqual(a, b, relative_tolerance)") at kaldi-error.cc:238
#6 0x000055555555acc1 in kaldi::AssertEqual (relative_tolerance=9.99999975e-06, b=<optimized out>, a=-5.5970788) at ../base/kaldi-math.h:279
#7 kaldi::UnitTestCuSparseMatrixTraceMatSmat<float> () at cu-sparse-matrix-test.cc:155
#8 0x000055555555ea42 in kaldi::CudaSparseMatrixUnitTest<float> () at cu-sparse-matrix-test.cc:274
#9 0x000055555555a154 in main () at cu-sparse-matrix-test.cc:296
(gdb)

Daniel Povey

unread,

Oct 30, 2023, 11:19:28 PM10/30/23

to kaldi...@googlegroups.com

OK, so the lines leading up to the test failure are:

trace1 = TraceMatMat(mat3, mat2, kNoTrans);

trace2 = TraceMatSmat(mat3, cu_smat2, kNoTrans);

AssertEqual(trace1, trace2, 0.00001);

so the failure will be in the call to TraceMatSmat with the kNoTrans argument,

so the call cuda_trace_mat_smat() seems to be the problem, that seems to go to the kernel called

_trace_mat_smat(), try adding __syncwarp() in there at the start of the for loop.

You can use similar logic for other failures.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/de7b4da6-b6f7-4979-9ade-821a42b37612n%40googlegroups.com.

Kumar

unread,

Oct 31, 2023, 4:50:12 AM10/31/23

to kaldi-help

I have added __syncwarp() in the for loop of _trace_mat_smat() starting from line no 360 as follows in cu-kernals.cu file

----------------------------------------------------------------------------------

static void _trace_mat_smat(const Real* mat, MatrixDim mat_dim,
const int* smat_row_ptr, const int* smat_col_idx,
const Real* smat_val, Real* trace_vec) {
const int i = blockIdx.x * blockDim.y + threadIdx.y; // row idx of smat
if (i < mat_dim.cols) {
const int nz_start = smat_row_ptr[i];
const int nz_end = smat_row_ptr[i + 1];
for (int nz_id = nz_start + threadIdx.x; nz_id < nz_end; nz_id +=
warpSize) {
__syncwarp();
const int j = smat_col_idx[nz_id]; // col idx of smat
trace_vec[nz_id] = mat[j * mat_dim.stride + i] * smat_val[nz_id];
}
}
}

and re compiles and tested cu-sparse-matrix-test, failed again following is the gdb back trace, same failure as before

--------------------------------------------------------------------------------

user@user-All-Series:/mnt/sd1/kaldi/src/cudamatrix$ export CUDA_LAUNCH_BLOCKING=1
user@user-All-Series:/mnt/sd1/kaldi/src/cudamatrix$ gdb ./cu-sparse-matrix-test
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./cu-sparse-matrix-test...
(gdb) catch throw
Catchpoint 1 (throw)
(gdb) r
Starting program: /mnt/sd1/kaldi/src/cudamatrix/cu-sparse-matrix-test
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

LOG ([5.5.1076~2-1b07b5]:SelectGpuId():cu-device.cc:168) Manually selected to compute on CPU.
LOG ([5.5.1076~2-1b07b5]:main():cu-sparse-matrix-test.cc:309) Tests without GPU use succeeded.
[New Thread 0x7fff98f5e000 (LWP 228795)]
[New Thread 0x7fff93fff000 (LWP 228796)]
[New Thread 0x7fff937fe000 (LWP 228797)]
LOG ([5.5.1076~2-1b07b5]:SelectGpuId():cu-device.cc:238) CUDA setup operating under Compute Exclusive Mode.
LOG ([5.5.1076~2-1b07b5]:FinalizeActiveGpu():cu-device.cc:338) The active GPU is [0]: NVIDIA GeForce GTX 1070 free:7967M, used:144M, total:8112M, free/total:0.982187 version 6.1
ASSERTION_FAILED ([5.5.1076~2-1b07b5]:AssertEqual():base/kaldi-math.h:279) Assertion failed: (ApproxEqual(a, b, relative_tolerance))

[ Stack-Trace: ]
/mnt/sd1/kaldi/src/lib/libkaldi-base.so(+0x7bde) [0x7ffff7fb2bde]
/mnt/sd1/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x2b4) [0x7ffff7fb3414]
/mnt/sd1/kaldi/src/cudamatrix/cu-sparse-matrix-test(kaldi::MessageLogger::Log::operator=(kaldi::MessageLogger const&)+0x20) [0x55555556c844]
/mnt/sd1/kaldi/src/lib/libkaldi-base.so(kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)+0xbd) [0x7ffff7fb364c]
/mnt/sd1/kaldi/src/cudamatrix/cu-sparse-matrix-test(+0xc297) [0x555555560297]
/mnt/sd1/kaldi/src/cudamatrix/cu-sparse-matrix-test(+0x1571e) [0x55555556971e]
/mnt/sd1/kaldi/src/cudamatrix/cu-sparse-matrix-test(void kaldi::CudaSparseMatrixUnitTest<float>()+0x17) [0x55555556d448]
/mnt/sd1/kaldi/src/cudamatrix/cu-sparse-matrix-test(main+0x11e) [0x55555556849f]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7ffff5629d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7ffff5629e40]
/mnt/sd1/kaldi/src/cudamatrix/cu-sparse-matrix-test(_start+0x25) [0x5555555600a5]

Thread 1 "cu-sparse-matri" received signal SIGABRT, Aborted.

__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737322786816) at ./nptl/pthread_kill.c:44

44 ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt

#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737322786816) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=140737322786816) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=140737322786816, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3 0x00007ffff5642476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x00007ffff56287f3 in __GI_abort () at ./stdlib/abort.c:79
#5 0x00007ffff7fb366a in kaldi::KaldiAssertFailure_ (func=0x55555557107b "AssertEqual", file=0x555555571066 "../base/kaldi-math.h", line=279,
cond_str=0x555555571040 "ApproxEqual(a, b, relative_tolerance)") at kaldi-error.cc:238
#6 0x0000555555560297 in kaldi::AssertEqual (a=-5.5970788, b=-8.97242928, relative_tolerance=9.99999975e-06) at ../base/kaldi-math.h:279
#7 0x000055555556971e in kaldi::UnitTestCuSparseMatrixTraceMatSmat<float> () at cu-sparse-matrix-test.cc:155
#8 0x000055555556d448 in kaldi::CudaSparseMatrixUnitTest<float> () at cu-sparse-matrix-test.cc:274
#9 0x000055555556849f in main () at cu-sparse-matrix-test.cc:296
(gdb)

Daniel Povey

unread,

Oct 31, 2023, 7:35:12 AM10/31/23

to kaldi...@googlegroups.com

Sorry, you can revert that, that kernel was not doing that kind of logarithmic reduction, just a linear one.

Try this change instead.

if (tid < warpSize) {

for (int shift = warpSize; shift > 0; shift >>= 1) {

// here: __syncwarp(), cu-kernels.cu:1798

sdata[tid] = op.Reduce(sdata[tid], sdata[tid + shift]);

}

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/e2d6d42b-bb36-4dde-9495-3b1ead145927n%40googlegroups.com.

Kumar

unread,

Oct 31, 2023, 7:52:03 AM10/31/23

to kaldi-help

did same and no change in the test log

----------------------------------------------------------------------

[New Thread 0x7fff9edff000 (LWP 58970)]
[New Thread 0x7fff9d383000 (LWP 58973)]
[New Thread 0x7fff9cb82000 (LWP 58974)]

LOG ([5.5.1076~2-1b07b5]:SelectGpuId():cu-device.cc:238) CUDA setup operating under Compute Exclusive Mode.
LOG ([5.5.1076~2-1b07b5]:FinalizeActiveGpu():cu-device.cc:338) The active GPU is [0]: NVIDIA GeForce GTX 1070 free:7967M, used:144M, total:8112M, free/total:0.982187 version 6.1
ASSERTION_FAILED ([5.5.1076~2-1b07b5]:AssertEqual():base/kaldi-math.h:279) Assertion failed: (ApproxEqual(a, b, relative_tolerance))

[ Stack-Trace: ]

/mnt/sd1/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x70c) [0x7ffff7f0a1ce]
/mnt/sd1/kaldi/src/lib/libkaldi-base.so(kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)+0x72) [0x7ffff7f0ab59]
/mnt/sd1/kaldi/src/cudamatrix/cu-sparse-matrix-test(+0x6cc1) [0x55555555acc1]
/mnt/sd1/kaldi/src/cudamatrix/cu-sparse-matrix-test(void kaldi::CudaSparseMatrixUnitTest<float>()+0x35) [0x55555555ea42]
/mnt/sd1/kaldi/src/cudamatrix/cu-sparse-matrix-test(main+0xab) [0x55555555a154]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7ffff5829d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7ffff5829e40]
/mnt/sd1/kaldi/src/cudamatrix/cu-sparse-matrix-test(_start+0x25) [0x555555559fe5]

Thread 1 "cu-sparse-matri" received signal SIGABRT, Aborted.

__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352507392) at ./nptl/pthread_kill.c:44

44 ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt

#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352507392) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=140737352507392) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=140737352507392, signo=signo@entry=6) at ./nptl/pthread_kill.c:89

#3 0x00007ffff5842476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x00007ffff58287f3 in __GI_abort () at ./stdlib/abort.c:79
#5 0x00007ffff7f0ab72 in kaldi::KaldiAssertFailure_ (func=func@entry=0x55555556205b "AssertEqual", file=file@entry=0x555555562046 "../base/kaldi-math.h",
line=line@entry=279, cond_str=cond_str@entry=0x555555562020 "ApproxEqual(a, b, relative_tolerance)") at kaldi-error.cc:238
#6 0x000055555555acc1 in kaldi::AssertEqual (relative_tolerance=9.99999975e-06, b=<optimized out>, a=-5.5970788) at ../base/kaldi-math.h:279
#7 kaldi::UnitTestCuSparseMatrixTraceMatSmat<float> () at cu-sparse-matrix-test.cc:155

#8 0x000055555555ea42 in kaldi::CudaSparseMatrixUnitTest<float> () at cu-sparse-matrix-test.cc:274
#9 0x000055555555a154 in main () at cu-sparse-matrix-test.cc:296
(gdb)

Daniel Povey

unread,

Nov 1, 2023, 12:37:52 AM11/1/23

to kaldi...@googlegroups.com

Try applying this diff to cu-kernels.cu:

note: if you are in cudamatrix/, you could do it with:

patch -p3 <this_diff_file

assuming you pasted the part after "diff -b" into a file.

If that doesn't work, do it manually.

Sorry, my CUDA is rusty.

git diff -b

diff --git a/src/cudamatrix/cu-kernels.cu b/src/cudamatrix/cu-kernels.cu

index 8044ff699..78def14d9 100644

--- a/src/cudamatrix/cu-kernels.cu

+++ b/src/cudamatrix/cu-kernels.cu

@@ -2087,13 +2087,13 @@ static void _group_transform_reduce(

x_idx += threads_per_group;

}

sreduction[tid] = treduction;

- if (threads_per_group > warpSize) {

+

__syncthreads();

- }

// tree-reduce to 2x warpSize elements per group

# pragma unroll

- for (int shift = threads_per_group / 2; shift > warpSize; shift >>= 1) {

+ int shift = threads_per_group / 2;

+ for (; shift > warpSize; shift >>= 1) {

if (threadIdx.x < shift) {

sreduction[tid] = op.Reduce(sreduction[tid], sreduction[tid + shift]);

}

@@ -2101,14 +2101,12 @@ static void _group_transform_reduce(

}

// Warp-reduce to 1 element per group.

- // Threads implicitly synchronized within the warp.

- const int warp_reduce_size =

- threads_per_group / 2 < warpSize ? threads_per_group / 2 : warpSize;

- if (threadIdx.x < warp_reduce_size) {

# pragma unroll

- for (int shift = warp_reduce_size; shift > 0; shift >>= 1) {

+ for (; shift > 0; shift >>= 1) {

+ if (threadIdx.x < shift) {

sreduction[tid] = op.Reduce(sreduction[tid], sreduction[tid + shift]);

}

+ __syncwarp();

}

// Store the result.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/60af00ca-0e0f-41d3-b1d5-a46efdab716fn%40googlegroups.com.

Kumar

unread,

Nov 1, 2023, 9:41:38 AM11/1/23

to kaldi-help

Applied this diff, recompiled and tested, no change in the failure log. If this issue is with the cuda version, then please suggest some other version of cuda which is compatible with kaldi as per my GPU and OS given below. Sorry , as I am approaching my deadlines ( I have to train an ASR on Kaldi for a course project), I may not be able to continue trying debugging for longer. I can try for few more days if can be solved soon. Thanks.

OS: Ubuntu 22.04

GPU: NVIDIA GeForce GTX 1070

Current CUDA version: 12.2.2

----------------------------------------------------------------------- log of cu-sparse-matrix-test---------------------------------------------------------

[New Thread 0x7fff98f5e000 (LWP 8820)]
[New Thread 0x7fff93fff000 (LWP 8821)]
[New Thread 0x7fff937fe000 (LWP 8822)]
WARNING ([5.5.1076~2-1b07b5]:SelectGpuId():cu-device.cc:243) Not in compute-exclusive mode. Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
[Thread 0x7fff937fe000 (LWP 8822) exited]
[Thread 0x7fff93fff000 (LWP 8821) exited]
LOG ([5.5.1076~2-1b07b5]:SelectGpuIdAuto():cu-device.cc:438) Selecting from 1 GPUs
[New Thread 0x7fff93fff000 (LWP 8823)]
[New Thread 0x7fff937fe000 (LWP 8824)]
LOG ([5.5.1076~2-1b07b5]:SelectGpuIdAuto():cu-device.cc:453) cudaSetDevice(0): NVIDIA GeForce GTX 1070 free:8011M, used:100M, total:8112M, free/total:0.987611
[Thread 0x7fff937fe000 (LWP 8824) exited]
[Thread 0x7fff93fff000 (LWP 8823) exited]
LOG ([5.5.1076~2-1b07b5]:SelectGpuIdAuto():cu-device.cc:501) Device: 0, mem_ratio: 0.987611
LOG ([5.5.1076~2-1b07b5]:SelectGpuId():cu-device.cc:382) Trying to select device: 0
[New Thread 0x7fff93fff000 (LWP 8825)]
[New Thread 0x7fff937fe000 (LWP 8826)]
LOG ([5.5.1076~2-1b07b5]:SelectGpuIdAuto():cu-device.cc:511) Success selecting device 0 free mem ratio: 0.987611

LOG ([5.5.1076~2-1b07b5]:FinalizeActiveGpu():cu-device.cc:338) The active GPU is [0]: NVIDIA GeForce GTX 1070 free:7967M, used:144M, total:8112M, free/total:0.982187 version 6.1
ASSERTION_FAILED ([5.5.1076~2-1b07b5]:AssertEqual():base/kaldi-math.h:279) Assertion failed: (ApproxEqual(a, b, relative_tolerance))

[ Stack-Trace: ]
/mnt/sd1/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x70c) [0x7ffff7f0a1ce]
/mnt/sd1/kaldi/src/lib/libkaldi-base.so(kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)+0x72) [0x7ffff7f0ab59]

/mnt/sd1/kaldi/src/cudamatrix/cu-sparse-matrix-test(+0xc297) [0x555555560297]
/mnt/sd1/kaldi/src/cudamatrix/cu-sparse-matrix-test(+0x1571e) [0x55555556971e]
/mnt/sd1/kaldi/src/cudamatrix/cu-sparse-matrix-test(void kaldi::CudaSparseMatrixUnitTest<float>()+0x17) [0x55555556d448]
/mnt/sd1/kaldi/src/cudamatrix/cu-sparse-matrix-test(main+0x11e) [0x55555556849f]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7ffff5629d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7ffff5629e40]
/mnt/sd1/kaldi/src/cudamatrix/cu-sparse-matrix-test(_start+0x25) [0x5555555600a5]

Thread 1 "cu-sparse-matri" received signal SIGABRT, Aborted.

__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352507392) at ./nptl/pthread_kill.c:44
44 ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352507392) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=140737352507392) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=140737352507392, signo=signo@entry=6) at ./nptl/pthread_kill.c:89

#3 0x00007ffff5642476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x00007ffff56287f3 in __GI_abort () at ./stdlib/abort.c:79

#5 0x00007ffff7f0ab72 in kaldi::KaldiAssertFailure_ (func=<optimized out>, file=<optimized out>, line=<optimized out>,

cond_str=0x555555571040 "ApproxEqual(a, b, relative_tolerance)") at kaldi-error.cc:238
#6 0x0000555555560297 in kaldi::AssertEqual (a=-5.5970788, b=-8.97242928, relative_tolerance=9.99999975e-06) at ../base/kaldi-math.h:279
#7 0x000055555556971e in kaldi::UnitTestCuSparseMatrixTraceMatSmat<float> () at cu-sparse-matrix-test.cc:155
#8 0x000055555556d448 in kaldi::CudaSparseMatrixUnitTest<float> () at cu-sparse-matrix-test.cc:274
#9 0x000055555556849f in main () at cu-sparse-matrix-test.cc:296
(gdb)

Daniel Povey

unread,

Nov 1, 2023, 12:14:51 PM11/1/23

to kaldi...@googlegroups.com

Your GPU is Pascal-based, not Volta, so I'm surprised it would get new problems like this.

You could try downgrading teh GPU toolkit, e.g. to a version 11.something, or to 12.1 or 12.0.

These problems only started happening very recently so it could be some change in their latest toolkit version.

Daniel Povey

unread,

Nov 1, 2023, 12:48:11 PM11/1/23

to kaldi...@googlegroups.com

There is also a possibility that the test failure in cu-sparse-matrix-test.cc was just a small roundoff issue that accidentally went below the

relative_tolerance specified in the test. printing the 2 values compared would easily verify this.

Similar with the other tests. Maybe some of the problems are really problems, some not.

Daniel Povey

unread,

Nov 3, 2023, 7:07:15 AM11/3/23

to kaldi...@googlegroups.com

You can apply this fix; will merge soon

https://github.com/kaldi-asr/kaldi/pull/4880

Kumar

unread,

Nov 3, 2023, 8:51:25 AM11/3/23

to kaldi-help

This fix is working. Now I am able to train on GPU. Thanks much for quick support.

Reply all

Reply to author

Forward