Optimizing nnet3/chain models for speed and memory consumption

1,962 views
Skip to first unread message

Guenter Bartsch

unread,
Nov 5, 2017, 7:55:20 AM11/5/17
to kaldi...@googlegroups.com
Dear all,

I am wondering if anybody could give me some hints on what
(hyper-)parameters, settings and model types I should try to optimize
first when tuning models for decoding speed and memory consumption vs
accuracy.

I have already figured out that pruning the language model helps a lot
with model size and therefore memory consumption ;) but other than
that I am a bit at a loss which options I should try next. Here is
what I am looking at right now (decoding the same wave file in both
models when model is already loaded in RAM):

nnet3 tdnn model: 1.48s model %WER 3.72
chain tdnn model: 1.70s model %WER 6.39

this was the first time I tried to build a chain model - and my
assumption that chain models are faster was obviously wrong at least
for my setup. I consider myself an absolute beginner here but I am
willing to learn - so pointers to documentation and source code to
read as well as options to investigate would be very welcome at this
point.

Background in case you are wondering what I am doing here: I am
working on scripts to build english and german models from voxforge
and librispeech. Those scripts can be found on my github
https://github.com/gooofy/speech. For the experiments I use my python
kaldi asr wrapper https://pypi.python.org/pypi/py-kaldi-asr.

Thanks,

Guenter

Daniel Povey

unread,
Nov 5, 2017, 11:20:36 AM11/5/17
to kaldi-help
I suspect that you got the command-line parameters wrong when decoding the chain model.  Make sure to use frame-subsampling-factor=3 and acoustic-scale=1.0.  Look at the options used in the decoding from the example scripts for inspiration, there may be more.  And if you are decoding just one file, bear in mind that loading the models and graphs gives a startup cost.  (However IIRC the program prints the real-time factor excluding that startup cost).

Dan



--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CADaHQ6x902B64TkOvOAsPK-6zPNmEqwu_X1K6eM1P0wXz4wjdA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Guenter Bartsch

unread,
Nov 5, 2017, 5:56:51 PM11/5/17
to kaldi...@googlegroups.com
wow, thanks for your quick reply, Dan! :)

frame-subsampling-factor=3 was something I was indeed missing and
adding that argument has improved decoding times considerably so the
chain model is now faster than the nnet3 tdnn one:

chain: model load: 2.23s decoding: 1.17s
nnet3: model load: 6.29s decoding: 1.48s

I have included the times for model loading to show that I am actually
measuring the decoding time only (I am using my own python scripts
here - that's also why the frame subsampling factor argument was
missing).

Are there any other decoder options I could try?

acoustic_scale=1.0, beam=7.0, frame_subsampling_factor=3.0

is what I am using at the moment - I see lattice-beam and max-active
in some of the scripts, could those help?

Besides that, I am still wondering what (hyper-) parameters impact the
speed/accuracy/memory trade-off when training chain models in
particular? My intention is to try a smaller relu_dim next (current
model uses relu_dim=725) but maybe other options are worth
investigating as well?

I also noted there are examples that have chain/tuning directories, I
plan to investigate (i.e.: diff) those as well but wonder which ones
are the most current / worth investigating first?

Thanks again,

Guenter

PS: in case you are wondering why I am still investigating speedups: I
would like to see if kaldi can be scaled down to embedded systems like
an rpi3 - here I am currently looking at 24.7s decode time for my test
wave file which is 4.7s long)

Daniel Povey

unread,
Nov 5, 2017, 6:45:37 PM11/5/17
to kaldi-help
You could make the beam very small (like 3.0) or the max-active very small (like 200), and see if it affects the speed at all- the WER will be terrible but it will give you an upper bound on the speed improvements possible from changing the decoder settings.
After that, the main opportunities to improve the speed would be to to reduce the dimension of the hidden layers and to reduce the number of leaves in the tree (see build_tree.sh in the training example script).  Or remove layers.  But try the others first.

Dan


--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.

Guenter Bartsch

unread,
Nov 5, 2017, 8:45:40 PM11/5/17
to kaldi...@googlegroups.com
Dan, once again thanks for your very quick and helpful reply! :)

On Mon, Nov 6, 2017 at 12:45 AM, Daniel Povey <dpo...@gmail.com> wrote:
> You could make the beam very small (like 3.0) or the max-active very small
> (like 200), and see if it affects the speed at all- the WER will be terrible
> but it will give you an upper bound on the speed improvements possible from
> changing the decoder settings.

tried that and it doesn't seem to give any additional speed increase
over beam 7 / max_active 1000 that I was using before.

However, I did find another way to speed things up: until now, I re-alloced

OnlineIvectorExtractorAdaptationState
OnlineSilenceWeighting
nnet3::DecodableNnetSimpleLoopedInfo
OnlineNnet2FeaturePipeline
SingleUtteranceNnet3Decoder

objects for each utterance - now with some simple log output
investigation I noted that instantiating
nnet3::DecodableNnetSimpleLoopedInfo seems to be a pretty costly
operation. I changed my code now so it will only instantiate
OnlineNnet2FeaturePipeline and SingleUtteranceNnet3Decoder objects
per-utterance while keep the other ones (is there any downside to
this?) and now I am looking at

13.35s

decode time for my 4.5s utterance on the rpi3 :) (0.57s on my intel i5
machine...)

> After that, the main opportunities to improve the speed would be to to
> reduce the dimension of the hidden layers and to reduce the number of leaves
> in the tree (see build_tree.sh in the training example script). Or remove
> layers. But try the others first.

oki, thanks! will try that next (will take some time until I have new
models). Also, I investigated more scripts from the example folder and
noted that unfortunately I had based my setup on an outdated script
(one I found in the librispeech example). I am planning to adapt my
script to

kaldi/egs/tedlium/s5_r2/local/chain/tuning/run_tdnn_1d.sh

now - I hope that one is reasonably recent?

any recommendation on other model types to try (chain-lstm,
chain-tdnn-lstm, nnet3-lstm, nnet3-tdnn, nnet3-tdnn-lstm) which might
offer a better
speed/accuracy/memory trade-off?

sorry for all those questions - this is just such an exciting project! :)

Guenter

Daniel Povey

unread,
Nov 5, 2017, 8:59:09 PM11/5/17
to kaldi-help
Look at whatever the soft link local/chain/run_tdnn.sh points to to see what is the latest and recommended model-- usually that will be the last-numbered one.
If speed is a concern, best not to try any of the other topologies-- the TDNN will be the fastest.
Dan



--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.

Guenter Bartsch

unread,
Nov 6, 2017, 7:40:56 PM11/6/17
to kaldi...@googlegroups.com
Dan,

On Mon, Nov 6, 2017 at 2:59 AM, Daniel Povey <dpo...@gmail.com> wrote:
> Look at whatever the soft link local/chain/run_tdnn.sh points to to see what
> is the latest and recommended model-- usually that will be the last-numbered
> one.

aaah - there is a symlink pointing to those! very cool, hadn't noticed
that - thanks! :)

> If speed is a concern, best not to try any of the other topologies-- the
> TDNN will be the fastest.

excellent - thanks again for your very quick and to the point answer,
highly appreciated! :)

keep up the good work,

guenter

Guenter Bartsch

unread,
Nov 13, 2017, 3:17:45 PM11/13/17
to kaldi...@googlegroups.com
just a quick follow-up: reducing the layer size from 450 to 250 let me
achieve (near) real-time performance for my chain models running on a
raspberry pi 3:

[bofh@donald py-kaldi-asr]$ python examples/chain_incremental.py
tdnn_250 loading model...
tdnn_250 loading model... done, took 7.084181s.
tdnn_250 creating decoder...
tdnn_250 creating decoder... done, took 14.327128s.
decoding data/gsp1.wav...
0.041s: 4000 frames ( 0.250s) decoded.
0.319s: 8000 frames ( 0.500s) decoded.
0.643s: 12000 frames ( 0.750s) decoded.
0.864s: 16000 frames ( 1.000s) decoded.
1.086s: 20000 frames ( 1.250s) decoded.
1.312s: 24000 frames ( 1.500s) decoded.
1.530s: 28000 frames ( 1.750s) decoded.
1.760s: 32000 frames ( 2.000s) decoded.
2.133s: 36000 frames ( 2.250s) decoded.
2.387s: 40000 frames ( 2.500s) decoded.
2.624s: 44000 frames ( 2.750s) decoded.
2.840s: 48000 frames ( 3.000s) decoded.
3.080s: 52000 frames ( 3.250s) decoded.
3.449s: 56000 frames ( 3.500s) decoded.
3.682s: 60000 frames ( 3.750s) decoded.
3.939s: 64000 frames ( 4.000s) decoded.
4.165s: 68000 frames ( 4.250s) decoded.
4.375s: 72000 frames ( 4.500s) decoded.
4.952s: 75200 frames ( 4.700s) decoded.

*****************************************************************
** data/gsp1.wav
** berlin gilt als weltstadt der kultur politik medien und wissenschaften
** tdnn_250 likelihood: 1.71563148499
*****************************************************************

tdnn_250 decoding took 4.96s
[bofh@donald py-kaldi-asr]$ uname -a
Linux donald 4.9.40-v7.1.el7 #1 SMP Tue Aug 8 14:03:02 UTC 2017 armv7l
armv7l armv7l GNU/Linux

WER still looks very good:

%WER 1.18 [ 1174 / 99422, 188 ins, 373 del, 613 sub ]
exp/nnet3_chain/tdnn_sp/decode_test/wer_9_0.0
%WER 1.57 [ 1563 / 99422, 250 ins, 446 del, 867 sub ]
exp/nnet3_chain/tdnn_250/decode_test/wer_8_0.0

if anybody is interested in these models, they are available for download here:

http://goofy.zamia.org/voxforge/de/kaldi-chain-voxforge-de-r20171113.tar.xz

while the scripts used are available on my github, the kaldi script in
particular is located here:

https://github.com/gooofy/speech/blob/master/data/src/speech/kaldi-run-chain.sh

thanks again for your support and keep up the good work! :)

guenter

Daniel Povey

unread,
Nov 13, 2017, 3:20:49 PM11/13/17
to kaldi-help
Cool!


   guenter

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.

sabr

unread,
Nov 24, 2017, 3:30:37 AM11/24/17
to kaldi-help
Hello,

Guenter Bartsch, could you please give answers on these questions as I'm going to consider your scripts for my program that should run ASR by request, e.g. loading models are the must for this program's daemon:

1) For nnet3 you had results these and what was the duration of WAV file? For example, how long does your model deal with 1 hour WAV?

chain: model load: 2.23s decoding: 1.17s 
nnet3: model load: 6.29s decoding: 1.48s 

2) Minor question: what is your GPU card model for these results above? I have 1x GeForce 1080Ti and hope it will be enough.

3) For long > 1 hour WAVs, what is the CPU and RAM utilization? This is something that I asked months ago, now I'm willing to change the strategy because 1 hour WAV of Kaldi ASR consumes ~4Gb of RAM

Guenter Bartsch

unread,
Nov 27, 2017, 5:08:19 PM11/27/17
to kaldi...@googlegroups.com
Hi,

On Fri, Nov 24, 2017 at 9:30 AM, sabr <sabrtas...@gmail.com> wrote:
> 1) For nnet3 you had results these and what was the duration of WAV file?

my test wave file is 4.7s

> For example, how long does your model deal with 1 hour WAV?

I have no experience with such long wave files - I am dealing with
single utterances at a time in a voice assistant like scenario mostly.

> 2) Minor question: what is your GPU card model for these results above? I
> have 1x GeForce 1080Ti and hope it will be enough.

for decoding I am not using a GPU at all, for training I am using a
dual xeon machine with 56GBytes of RAM and a GTX 980 GPU

> 3) For long > 1 hour WAVs, what is the CPU and RAM utilization? This is
> something that I asked months ago, now I'm willing to change the strategy
> because 1 hour WAV of Kaldi ASR consumes ~4Gb of RAM

sorry, I have no experience with long wave files. for my shortes ones
I can report that even a rpi3 with 1GB of ram is sufficient.

guenter

Daniel Povey

unread,
Nov 27, 2017, 5:13:57 PM11/27/17
to kaldi-help
For long WAV files you should probably break it up into smaller pieces (certainly no longer than a minute each); you can
do this by creating a 'segments' file., and you can later break it up into pieces.
If you have your data formatted as a data directory you can uniformly segment it using a sequence of commands similar to
stage 3 of the script steps/segmentation/prepare_targets_gmm.sh.



   guenter

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.

Ho Yin Chan

unread,
Nov 28, 2017, 11:57:13 AM11/28/17
to kaldi-help
BTW, why are acoustic scores for the chain model are much less "strong" than those trained with cross-entropy cost function / GMM models ? Any intuition behind ?

Ricky 

Dan Povey於 2017年11月6日星期一 UTC+8上午12時20分36秒寫道:
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

Daniel Povey

unread,
Nov 28, 2017, 12:02:12 PM11/28/17
to kaldi-help
Because they are trained with a sequence-level objective (log-prob of the correct word sequence) and in order to maximize that, it needs to be less strong in order to get well-calibrated posteriors.  (You'd need to scale down scores from a cross-entropy system with a 10ms frame shift by a factor of 10 or so to get well-calibrated posteriors).


To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

Ho Yin Chan

unread,
Nov 28, 2017, 11:19:39 PM11/28/17
to kaldi-help
Sorry for a bit out of the topic with the above title.

It seems when we are training from scratch using the GMMs, the EM (either with full forward/backward or viterbi state alignment) for ML models are already produced with acoustic scores in different high dimensional search space for optimal word sequence paths (compare to current chain model) where the GMMs produce acoustic scores which require a higher level of LM scale. And the MMI training for GMM are done with weak LM model for lattice generations, the acoustic scores are just maintained in similar level scale with the MMI objective function. I am not sure if a single state model (instead of traditional e.g 3 states HMM with transition probability which can be optimized during the EM procedure) and currently we have sub-sampling in the output of  the neural network (instead of 1frame input, 1 frame output in the neural network for frame base cross-entropy training) would affect the acoustic scale in the search of optimal word sequence in high dimensional search space.

Ricky

Dan Povey於 2017年11月29日星期三 UTC+8上午1時02分12秒寫道:

Daniel Povey

unread,
Nov 28, 2017, 11:21:24 PM11/28/17
to kaldi-help
In LF-MMI AKA chain models, it's trained from scratch without any acoustic scale, which ensures that the posteriors are well-calibrated, and no (or very little) acoustic scaling is required.


To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

Ho Yin Chan

unread,
Nov 29, 2017, 11:05:32 AM11/29/17
to kaldi-help
That's reasonable.  Thanks.

Dan Povey於 2017年11月29日星期三 UTC+8下午12時21分24秒寫道:

Guenter Bartsch

unread,
Nov 29, 2017, 6:58:44 PM11/29/17
to kaldi...@googlegroups.com
another follow-up on my chain models for embedded platforms
experiment: my 1000+ hour english models (trained on
librispeech+voxforge data) have just finished training (using the same
scripts as the german models) and I am pretty happy with the results:

%WER 2.48 [ 12525 / 504653, 737 ins, 2720 del, 9068 sub ]
exp/nnet3_chain/tdnn_sp/decode_test/wer_10_0.0
%WER 3.03 [ 15269 / 504653, 948 ins, 3260 del, 11061 sub ]
exp/nnet3_chain/tdnn_250/decode_test/wer_9_0.0

plus: the smaller (250) model still achieves near realtime performance
on a raspberry pi 3:

[bofh@donald py-kaldi-asr]$ python examples/chain_incremental.py
tdnn_250 loading model...
tdnn_250 loading model... done, took 23.394126s.
tdnn_250 creating decoder...
tdnn_250 creating decoder... done, took 14.411979s.
decoding data/dw961.wav...
0.087s: 4000 frames ( 0.250s) decoded.
0.400s: 8000 frames ( 0.500s) decoded.
0.742s: 12000 frames ( 0.750s) decoded.
1.021s: 16000 frames ( 1.000s) decoded.
1.263s: 20000 frames ( 1.250s) decoded.
1.497s: 24000 frames ( 1.500s) decoded.
1.714s: 28000 frames ( 1.750s) decoded.
1.992s: 32000 frames ( 2.000s) decoded.
2.370s: 36000 frames ( 2.250s) decoded.
2.642s: 40000 frames ( 2.500s) decoded.
2.873s: 44000 frames ( 2.750s) decoded.
3.112s: 48000 frames ( 3.000s) decoded.
3.333s: 52000 frames ( 3.250s) decoded.
3.668s: 56000 frames ( 3.500s) decoded.
3.876s: 60000 frames ( 3.750s) decoded.
4.092s: 64000 frames ( 4.000s) decoded.
4.305s: 68000 frames ( 4.250s) decoded.
4.517s: 72000 frames ( 4.500s) decoded.
4.951s: 74000 frames ( 4.625s) decoded.

*****************************************************************
** data/dw961.wav
** i cannot follow you she said
** tdnn_250 likelihood: 1.99656772614
*****************************************************************

tdnn_250 decoding took 4.95s

in case anybody is interested in trying these models, they are
available for download here:

http://goofy.zamia.org/voxforge/en/kaldi-chain-voxforge-en-r20171129.tar.xz

and, as always, all my scripts are open source, freely available on github:

https://github.com/gooofy/speech

guenter

Daniel Povey

unread,
Nov 29, 2017, 7:01:37 PM11/29/17
to kaldi-help
Thanks a lot!




   guenter

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages