CUDA questions

RPC Portland

unread,

Jan 12, 2016, 4:57:37 PM1/12/16

to kaldi-help

Hello all,

You've been very helpful in previous posts, and I'm much further along now than I was previously. I've secured help from a speech recognition consultant to assist in getting Kaldi running. At this point, I'm still having trouble with my GPU. I have a Tesla S1070, which has four C1060's, using the T10 processor, which has compute capability of 1.3 (I'm hoping that's not too old). I have installed CUDA toolkit 6.5 with the 340.xx drivers, which are the latest that support this GPU. I was told to use CUDA 6.5, which is also the latest that supports this GPU. I've re-compiled kaldi once this version of CUDA and drivers was installed.

So here's the problem. I had previously succeeded in running run.sh in the tedlium recipe without a problem (but with a different version of toolkit and drivers installed), but now I'm getting this error when running run_nnet2_ms_perturbed.sh (from exp/nnet2_online/nnet_ms_sp/log/train.0.2.log):

nnet-copy-egs --frame=0 ark:exp/nnet2_online/nnet_ms_sp/egs/egs.2.ark ark:-
ERROR (nnet-train-simple:CopyRows():cu-matrix.cc:2138) cudaError_t 8 : "invalid device function " returned from 'cudaGetLastError()'
ERROR (nnet-train-simple:AddDiagMatMat():cu-vector.cc:580) cudaError_t 8 : "invalid device function " returned from 'cudaGetLastError()'

This happens after it acts like it's able to successfully create one model: LOG (nnet-train-transitions:main():nnet-train-transitions.cc:140) Trained transitions of neural network model and wrote it to exp/nnet2_online/nnet_ms_sp/0.mdl

So I ran 'make test' under src/cudamatrix

Running cu-vector-test .../bin/sh: line 1: 27150 Aborted                 (core dumped) ./$x > $x.testlog 2>&1
... FAIL cu-vector-test
Running cu-matrix-test .../bin/sh: line 1: 27161 Aborted                 (core dumped) ./$x > $x.testlog 2>&1
... FAIL cu-matrix-test
Running cu-math-test .../bin/sh: line 1: 27171 Aborted                 (core dumped) ./$x > $x.testlog 2>&1
... FAIL cu-math-test
Running cu-test .../bin/sh: line 1: 27180 Aborted                 (core dumped) ./$x > $x.testlog 2>&1
... FAIL cu-test
Running cu-sp-matrix-test .../bin/sh: line 1: 27189 Aborted                 (core dumped) ./$x > $x.testlog 2>&1
... FAIL cu-sp-matrix-test
Running cu-packed-matrix-test .../bin/sh: line 1: 27198 Aborted                 (core dumped) ./$x > $x.testlog 2>&1
... FAIL cu-packed-matrix-test
Running cu-tp-matrix-test .../bin/sh: line 1: 27207 Aborted                 (core dumped) ./$x > $x.testlog 2>&1
... FAIL cu-tp-matrix-test
Running cu-block-matrix-test .../bin/sh: line 1: 27216 Aborted                 (core dumped) ./$x > $x.testlog 2>&1
... FAIL cu-block-matrix-test
Running cu-matrix-speed-test .../bin/sh: line 1: 27225 Aborted                 (core dumped) ./$x > $x.testlog 2>&1
... FAIL cu-matrix-speed-test
Running cu-vector-speed-test .../bin/sh: line 1: 27235 Aborted                 (core dumped) ./$x > $x.testlog 2>&1
... FAIL cu-vector-speed-test
Running cu-sp-matrix-speed-test .../bin/sh: line 1: 27244 Aborted                 (core dumped) ./$x > $x.testlog 2>&1
... FAIL cu-sp-matrix-speed-test
Running cu-array-test .../bin/sh: line 1: 27253 Aborted                 (core dumped) ./$x > $x.testlog 2>&1
... FAIL cu-array-test
Running cu-sparse-matrix-test .../bin/sh: line 1: 27262 Aborted                 (core dumped) ./$x > $x.testlog 2>&1
... FAIL cu-sparse-matrix-test
Running cu-device-test ...... SUCCESS
make: *** [test] Error 1

So I'm not quite sure what's wrong, but I'm worried that CUDA 6.5 doesn't fully support my GPUs, as they are older. I was told kaldi needed 6.5, but the makefiles seem to indicate older toolkits will work. Could I run 5.5 or 6 with kaldi? They appear to more fully support my GPUs, which are plenty powerful for the task at hand if I can get them to work. I'm afraid if I had to run run_nnet2_ms_perturbed.sh in CPU-only mode, it would take too long (even with quad socket, quad core 2.9 GHz CPUs and 64 GB RAM).

I had read this post on the old forums: http://sourceforge.net/p/kaldi/discussion/1355348/thread/c9991f50/
And tried what was recommended in there.

What would you recommend I do or try?

Thanks!

Rhiannon

Daniel Povey

unread,

Jan 12, 2016, 5:08:01 PM1/12/16

to kaldi-help

From here

https://github.com/BVLC/caffe/issues/138

it looks like you need to have an option like -gencode arch=compute_35,code=sm_35

to support that architecture, but the Makefile should already do that. Maybe do 'make clean' in cudamatrix/ and make again and check that that option is being supplied when compiling the CUDA code.

Dan

--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

RPC Portland

unread,

Jan 12, 2016, 7:54:25 PM1/12/16

to kaldi-help, dpo...@gmail.com

Dan,

I had looked over that portion of Makefile many times, trying to be sure it's not telling it to run with anything over 1.3. I haven't modified the file to see what is in that variable when it starts the tests. It is confusing how it's written in the Makefile, giving higher Compute abilities for lower CUDA versions, though it shows it would pass 1.0 and 1.3 for CUDA 6.5. Was thinking of hard-coding the architecture type.

However, it hadn't occurred to me to run make clean - I clean forgot to do that (sorry, but I had to). I'll try that now, and read the link you passed on.

Thanks!

Rhiannon

Daniel Povey

unread,

Jan 12, 2016, 8:29:57 PM1/12/16

to RPC Portland, kaldi-help

It adds a bunch of -gencode options just in case. Probably no harm to have it add all of those it wants to add, i.e. use the Makefile unmodified. According to https://en.wikipedia.org/wiki/CUDA, the K10 has compute capability 3.0, which may explain the problem if you have removed later ones from the Makefile.

Dan

RPC Portland

unread,

Jan 14, 2016, 4:17:57 PM1/14/16

to kaldi-help, rhiann...@risingcables.com, dpo...@gmail.com

I'm not running the K10, I'm running the Tesla S1070 (4xC1060's) which is based on the T10, which has a compute capability of 1.3. Using the Makefile unmodified, it fails on every test, but by adding this line

CUDA_ARCH=-gencode arch=compute_13,code=sm_13

after all the CUDA endif's, I can get most of them to succeed:

Running cu-vector-test .../bin/sh: line 1: 5113 Aborted                 (core dumped) ./$x > $x.testlog 2>&1
... FAIL cu-vector-test
Running cu-matrix-test .../bin/sh: line 1: 5122 Aborted                 (core dumped) ./$x > $x.testlog 2>&1
... FAIL cu-matrix-test
Running cu-math-test ...... SUCCESS
Running cu-test ...... SUCCESS
Running cu-sp-matrix-test ...... SUCCESS
Running cu-packed-matrix-test ...... SUCCESS
Running cu-tp-matrix-test ...... SUCCESS
Running cu-block-matrix-test .../bin/sh: line 1: 5173 Aborted                 (core dumped) ./$x > $x.testlog 2>&1

... FAIL cu-block-matrix-test
Running cu-matrix-speed-test ...

... SUCCESS
Running cu-vector-speed-test ...... SUCCESS
Running cu-sp-matrix-speed-test ...... SUCCESS
Running cu-array-test ...... SUCCESS
Running cu-sparse-matrix-test ...... SUCCESS

Running cu-device-test ...... SUCCESS
make: *** [test] Error 1

However, even after doing a make clean, ./configure, make depend and make on kaldi-trunk/src, then a make clean and make test on src/cudamatrix, then trying to run the run_nnet2_ms_perturbed.sh, it still fails with the same error:

ERROR (nnet-train-simple:CopyRows():cu-matrix.cc:2138) cudaError_t 8 : "invalid device function " returned from 'cudaGetLastError()'
ERROR (nnet-train-simple:AddDiagMatMat():cu-vector.cc:580) cudaError_t 8 : "invalid device function " returned from 'cudaGetLastError()'

One thing I've come across in kaldi forums and other places is the possibility that I have the wrong driver/module for the CUDA toolkit. I'm running CUDA 6.5 with 340.xx module, which should be compatible, but were not installed together. Does kaldi run with CUDA 5.5? Wondering if I might have better luck with a toolkit / module combo that's installed together. The .deb repository from NVidia that was supposed to install CUDA 6.5 and 340.xx modules kept installing CUDA 7.5, which doesn't work with my card, and another I found for CUDA 6.5 kept installing 352.xx modules, which doesn't work with my card. So maybe I'd have better luck with CUDA 5.5 and 340.xx modules that actually installs both at the same time.

Whatever is happening, it seems to be sending the wrong calls to the card. I was under the impression a previous code update fixed the problem of not clearing the cudaGetLastError() each time.

Thanks!

Rhiannon

Daniel Povey

unread,

Jan 14, 2016, 4:51:14 PM1/14/16

to RPC Portland, Karel Veselý, kaldi-help

The Makefile, which I think was created by Karel, ads the following lines:

CUDA_VER_GT_6_5 := $(shell [ $(CUDA_VERSION) -ge 65 ] && echo true)

ifneq ($(CUDA_VER_GT_6_5), true)

CUDA_ARCH += -gencode arch=compute_13,code=sm_13 \

-gencode arch=compute_10,code=sm_10

endif

So it is only adding compute capability 1.3 if the toolkit version is older than 6.5. I imagine there must be a reason for this, but I'm not sure what it is. I don't think I want to invest a lot of time figuring out what the problem is because it's likely the answer is that it can't be done. You could try installing an earlier version of the CUDA toolkit and see what happens.

Dan

Vesely Karel

unread,

Jan 15, 2016, 10:03:43 AM1/15/16

to dpo...@gmail.com, RPC Portland, kaldi-help, Jan Trmal

Hi, unfortunately I don't have access to a gpu with compute capability 1.3.

The issue with the Makefile is easy to rectify. We could ask NVidia guys for info,
in what version exactly the compute capabilities 1.0 and 1.3 became unsupported
and modify the Makefile accordingly. What was more complicated to figure out
is why the unit tests are failing.

Based on the 'ERROR' lines:

ERROR (nnet-train-simple:CopyRows():cu-matrix.cc:2138)

ERROR (nnet-train-simple:AddDiagMatMat():cu-vector.cc:580)

I am almost sure that the problem is caused by
the heuristic to determine the size of the 'dimBlock' and 'dimGrid',
which is done in 'GetBlockSizesForSimpleMatrixOperation':

The difference among compute capabilities is following:
1.3 (and older) :
- threads in block max 512,
- x- and y- dimension is limited to 512,
- max grid size 65535 x 65535 (i.e. number of blocks on x- and -y axis),

2.0 (and newer) :
- threads in block max 1024,
- x- and y- dimension is limited to 1024,
- max grid size (2^31-1) x 65535

More info is here:
https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications

@Rhiannon: I would consider upgrading the GPU to a more recent ones,
which will be much faster. The 32bit float performance is following:

S1070: 2500 / 4 = 625GFLOPs (per GPU core)
GTX970: 3500GFLOP (per GPU core, ~$300)

@Dan,Yenda: Should we support the compute capability 1.3?
(i.e. GPUs older than Fermi, which are >5x slower than GTX970),
- if YES, we should fix the code (limit block size to 512, maybe add support for 1024 threads),
- if NO, we should change the code to require minimum capability of 2.0,

Personally I'd advise the 'NO' case, but if the compatibility
with older hardware is important we can do the 'YES' case.

What do you guys think?

Best!
Karel.

-- 
Karel Vesely, Brno University of Technology
ives...@fit.vutbr.cz, +420-54114-1300

Daniel Povey

unread,

Jan 15, 2016, 1:05:57 PM1/15/16

to Vesely Karel, RPC Portland, kaldi-help, Jan Trmal

I'd say no, we shouldn't support it- we're all too busy. Anyway those old GPUs may not work for other reasons (e.g. memory being insufficient).

Dan

filmingi...@gmail.com

unread,

Jan 15, 2016, 3:56:55 PM1/15/16

to kaldi-help, ives...@fit.vutbr.cz, rhiann...@risingcables.com, jtr...@gmail.com, dpo...@gmail.com

Well this entire project may fail if I can't make these GPUs work, unless I'm able to get run_nnet2_ms_perturbed.sh to finish in CPU-only mode (unsure how long that would take, even with 16 CPUs and plenty of RAM). I know for a fact that CUDA 6.5 is the last toolkit to support compute 1.3, and I had modded the Makefile to use 1.3 despite the 'older than 6.5' stuff that threw it off. It has plenty of memory (16 GB - 4 x 4) and certainly plenty of power since it ran the main tedlium run.sh in just under a day.

In the absence of any documentation I could find, or forum posts, saying what the minimum recommended GPU requirements were, I obtained and configured this setup. We've retained the Cobalt consultants and already paid for consulting time which we cannot yet use without this completing. And all of this just so I can achieve basic proof-of-concept before more money will be made available for newer stuff. So frustrating, I'm so close after so much time spent.

I've been in the IT game over twenty years, and what drives me nuts about technology is how a device or system that's really only a few years old and can suit the needs of the user just fine (like a smartphone) is completely unsupported in just a few years, and it makes no sense. Not placing blame on you here, but the technology providers who move forward so fast and abandon previous tech way too soon. I'm not a believer in perfectly-good tech being obsolete so soon.

Anyhow, I'm stuck. Is there some mods I can make to some source files to make this work? If it took an hour of someone's time, we'd pay for that. I'm already working over 16 hours a day x 7 days a week trying to handle everything on my plate, and I've got to get this going quickly without blowing much more money.

I read up on the GTX970, and aside from the big to-do about its misstated specs, not sure if it will work in the external PCI-E expansion chassis we have on the Tesla S1070 like I know some of the Fermi GPUs will (like the C2050/2070). I can get C2070's for $99 if they would work for training. I doubt I can build a PC that will handle GPU video cards straight up due to the cost, and you can't run them in most PowerEdge's so I'd like to re-use this chassis if possible.

So I don't know what now. If I can't somehow fenagle it to complete with what I have, would it at least run with the C2070's?

Thanks,

Rhiannon

Daniel Povey

unread,

Jan 15, 2016, 4:07:08 PM1/15/16

to Filming In Portland, kaldi-help, Karel Veselý, RPC Portland, Jan Trmal

I'm not sure why it's not working- I doubt it's the limitation to 512 threads per block, we don't use more than 256 anyway. If you could work out using gdb where exactly in the various tests it's failing, and with which problem sizes, that would be helpful. You may have to find someone who knows how to use gdb.

Dan

Karel Veselý

unread,

Jan 16, 2016, 4:09:26 AM1/16/16

to dpo...@gmail.com, Filming In Portland, kaldi-help, RPC Portland, Jan Trmal

Hi,
I can look into that and see what the problem really is. I'll ask our admins
to prepare a machine with GTX285 which is the same compute capability 1.3.
Also, I'll be on vacation whole next week, so it will take some time.

As Dan said, getting more info by debugging the unit tests would be helpful,
the dimension of blocks and grids is one of the hottest clues now.

Thanks,
Karel.

Dne 15. 1. 2016 v 22:07 Daniel Povey napsal(a):

Karel Vesely

unread,

Jan 25, 2016, 8:00:27 AM1/25/16

to filmingi...@gmail.com, kaldi-help, rhiann...@risingcables.com, jtr...@gmail.com, dpo...@gmail.com

Hi, a good comparison of GPUs is here:
https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units

The most important factors are 'Single precision processing power' and
'Memory bandwidth', the larger number the faster runtime.
I would avoid buying the old Teslas, a single gaming GPUs from series
9xx will be more powerful
than several of the old Teslas. If the new GPU does not work in your
chasis, you can put it in
a PC with some stronger Power Supply Unit and PCI-express 16x slot on a
motherboard.
Even a single GPU is enough to train a Neural Network...

Or, eventually you could use some pre-built English models from here:
http://kaldi-asr.org/downloads/all/egs/

Best,
Karel.

Dne 15.1.2016 v 21:56 filmingi...@gmail.com napsal(a):

> I read up on the GTX970, and aside from the big to-do about its
> misstated specs, not sure if it will work in the external PCI-E
> expansion chassis we have on the Tesla S1070 like I know some of the
> Fermi GPUs will (like the C2050/2070). I can get C2070's for $99 if
> they would work for training. I doubt I can build a PC that will
> handle GPU video cards straight up due to the cost, and you can't run
> them in most PowerEdge's so I'd like to re-use this chassis if possible.

Karel Vesely

unread,

Jan 29, 2016, 10:04:27 AM1/29/16

to filmingi...@gmail.com, kaldi-help, rhiann...@risingcables.com, jtr...@gmail.com, dpo...@gmail.com

Hi,
I tried to replicate the issue with similar hardware (GTX285, same
compute capability 1.3 as your Teslas).
First, because the card is old, we have to use old CUDA 6.5.14, which
also requires to use older compiler gcc 4.8.

Then the current nvidia-smi from GPU driver version 352.30 could not
work with the old GPU,
while the GPU driver must be at least 340.00 (the CUDA 6.5.14 package
contains driver 340.29).

And installing the compatible version of the driver locally on single
node in our cluster would
take >2hours for our admin who does an excellent job for us and who is
busy with many other
things, so we stop at this point.

I did also check if the kernel call hyper parameters are okay (dimBlock,
dimGrid),
and it seems okay within the limits for compute capability 1.3.

@Rhiannon: Any update on your side? Did you try the gdb (or nvcc-gdb)?
Are you sure your GPUs are in a good shape? (did you try them on some
other software besides kaldi?)

Now I understand what poor support of old hardware means practically.
On the other hand we should keep in mind the practical merit.
Especially the GPUs are becoming slow and obsolete quite fast.

Best regards,
Karel.

Dne 25.1.2016 v 14:05 Karel Vesely napsal(a):

RPC Portland

unread,

Feb 11, 2016, 5:04:20 PM2/11/16

to kaldi-help

Sorry for the delayed reply, I've been trying all sorts of things trying to get kaldi working, as well as other projects. At this point, we've run out of consulting time that was paid for without a working kaldi engine, so we're probably going to have to use 3rd-party APIs for ASR for the time being as we must get this project online and can't afford more delays. The huge downside with this is the expense vs decoding in-house. I'm going to continue trying to get kaldi to work in the meantime, which will bring decoding down in price by a huge degree. This project could never succeed long-term outsourcing the ASR.

Some answers -

Yes, the Tesla S1070 required CUDA 6.5 and 340 drivers; I had those properly installed and that particular system worked fine on non-kaldi projects I tested with (older scientific applications). Cudamatrix failed on many of the tests (as posted).

I then acquired a Tesla C2090 for $99 (which actually has a Firmi engine) and installed CUDA 6.5 with 352 drivers (as I was told CUDA 6.5 was recommended), though it will work fine with CUDA 7. It supports Compute 2.0. It passed all cudamatrix tests fine, but still failed running nnet2 scripts with the same exact error as the S1070. It also worked with the non-kaldi software I ran as a test, as well as passing all diagnostics I ran.

There were no further funds available to just purchase a GTX970 for ~$340 and build another computer to host it (if it doesn't fit in the external GPU enclosure, which I doubt it will, as it's so big). We have paid a significant amount for the consultant to help get kaldi going, our web developer, office rent, etc etc.

At this point, I've received my personal tax refund and will likely have to use money from that to acquire a GTX970 and possibly build a cheap PC to host it. That money was already earmarked for paying down debt, but I must get this working.

We had already tried the pre-built models you had posted, but they don't work - the tedlium models are missing the graph (and possibly another piece, I need to check) for using nnet2. Librispeech and Fisher-English models have multiple different versions of its various parts uploaded, and when I tried to use them, they crashed every time. The consultant said they needed to have consistent versions. Otherwise, I would have been more than happy to just use pre-built models.

I don't know anything about gdb or how to use it. I've been hitting Google trying to figure out how to use it in this application, but haven't figured it out yet. That may have to wait until after I get an API working so that the project is functional for testing.

Without having a straightforward list of requirements, it's hard to know what will and won't work without trying it. The Makefile seems to indicate it will work down to 1.3, but it's obvious that 1.3 and 2.0 are not working with the nnet2 scripts.

With a startup without outside funding and my devoting all my time to it and my other business (that is, not working with a paycheck), money is extremely tight. I'd love to just buy new stuff and not worry about it. I chose hardware that is certainly powerful enough to handle the task at hand, but it's apparent that kaldi doesn't fully support it. What I don't get, is kaldi has been around for more than a couple years, weren't these GPUs supported when much of kaldi was initially written? Or is nnet2 very new? Has it only been tested on the latest-generation NVidia GPUs?

Again, I don't buy into older hardware simply "being obsolete" because it's a few years old (the C2090 being from 2011), if it has the power necessary for the task at hand, though I don't know what is involved with writing code that *supports* older GPUs with lower Compute Ability, and if that had been a lot of work, I DO understand why it would not have supported older GPUs. I've just never supported the concept of replacing everything every other year.

Thanks,

Rhiannon

RPC Portland

unread,

Feb 11, 2016, 5:06:19 PM2/11/16

to kaldi-help

One other thing - Are the newest GPUs like the GTX970 the only ones you know for sure work with the nnet2 and nnet3 scripts, or have you tested it with older (aka cheaper) GPUs? In other words, what is the lowest Compute Ability you know works with them?

Thanks

Rhiannon

Daniel Povey

unread,

Feb 11, 2016, 5:18:15 PM2/11/16

to kaldi-help

We've only supported neural net training in Kaldi for about 3 years. We've got it to work with the GPUs we had access to, which doesn't include the older ones that you are using. If you can answer our questions about what tests were failing with what problem sizes, then we might be able to fix it. That will require you learning how to use gdb, which is a very basic skill ... it will involve typing something like

gdb matrix-lib-test

(gdb) catch throw

(gdb) r

and then when it fails, keep typing "up" until you reach the stack frame where you can see variables that indicate the problem size, and use the "print" command to view them, e.g.

(gdb) print num_rows

or

(gdb) print x

or whatever the variable name happens to be.

We don't know exactly which GPUs will work with Kaldi because we haven't done extensive tests with different GPUs.

If all you want to do is decode with a pre-built model, there are some instructions at http://kaldi-asr.org/doc/online_decoding.html which should work.