How to train a binary neural networks for speech recognition

658 views
Skip to first unread message

fei

unread,
Aug 1, 2018, 8:39:16 AM8/1/18
to kaldi-help
Hi,
    In order to reduce the model size, I want to  train a binary neural network .Has anyone tried this? Or can you give me some advice? 

Daniel Povey

unread,
Aug 1, 2018, 3:51:52 PM8/1/18
to kaldi-help
You can always compress the model after it's trained-- a few bits should be enough.
I have a hard time thinking of a scenario where this would matter though.  Normally in situations where on-device storage is at a premium, speed is at a premium too so you'd need a model that's fast to evaluate; I doubt that binary weights would really help there.
Dan



On Wed, Aug 1, 2018 at 5:39 AM, fei <chenshu...@gmail.com> wrote:
Hi,
    In order to reduce the model size, I want to  train a binary neural network .Has anyone tried this? Or can you give me some advice? 

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/e05e49d4-f4a0-4286-9618-236b00e74538%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Arkadi Gurevich

unread,
Aug 19, 2018, 1:46:24 AM8/19/18
to kaldi-help
Hi Daniel,

According to this article " Binarized Neural Networks ", this technique should reduce memory size during the forward pass.

Don't you think it may work on kaldi's nnet ? 

All best,
Arkadi

On Wednesday, August 1, 2018 at 10:51:52 PM UTC+3, Dan Povey wrote:
You can always compress the model after it's trained-- a few bits should be enough.
I have a hard time thinking of a scenario where this would matter though.  Normally in situations where on-device storage is at a premium, speed is at a premium too so you'd need a model that's fast to evaluate; I doubt that binary weights would really help there.
Dan


On Wed, Aug 1, 2018 at 5:39 AM, fei <chenshu...@gmail.com> wrote:
Hi,
    In order to reduce the model size, I want to  train a binary neural network .Has anyone tried this? Or can you give me some advice? 

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

Daniel Povey

unread,
Aug 19, 2018, 1:52:21 PM8/19/18
to kaldi-help
Sure, it might work, but would require some work. And it depends what
your target low-power device is, whether it can actually make use of
that (i.e. efficiently in time). My feeling is that in many such
cases, it would make sense to just train a smaller model in the first
place, because a binarized or compressed model would often be too slow
to evaluate. Certainly storing each parameter in a byte might make
sense for some devices, but it's not implemented yet. Device support
gets very complicated and time-consuming and I think Kaldi can have
more impact by concentrating on the more core stuff.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/9c91e1da-153a-472a-8060-2ce6596957c2%40googlegroups.com.

Arkadi Gurevich

unread,
Aug 19, 2018, 3:31:22 PM8/19/18
to kaldi-help
Thanks Daniel,

I'd be happy to hear your advice about the project I'm working on, 
I want to build a  phoneme recognition CNN ( probably a TIMIT style network ) 
but minimize memory footprint ( up to 10 MB ).
I am aware that the WER will increase .

I have read a number of articles on different techniques to reduce model's memory consumption,
for example binarized or compressed model, weights quantization ,weights conversion from float-point to fixed-point, pruning weights etc.

The steps I thought to do were : 

1. Train nnet based on mini-librispeech ( or another small model ).
2. I will use one of the techniques to reduce the size of the model . 
3. I will build CNN with the weights received in training 


Do you think it is possible to reach these sizes in the neural network ?
Does this process sound logical to you ? 
How would you recommend reaching the goal ( phoneme recognition based CNN with memory consumption up to 10 MB ) ? 

Arkadi

Daniel Povey

unread,
Aug 19, 2018, 3:40:01 PM8/19/18
to kaldi-help
Our models for mini_librispeech right now are 20M on disk, so even by
reducing the floats to 16 bit you'd get those models to 10M. I
suggest to work on other aspects of the project first though-- i.e.
see if you can accomplish the task with a bigger model before worrying
about compressing it.

Dan
> To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/024a62fc-075b-4bc9-8914-530a1cd068f6%40googlegroups.com.
Message has been deleted

Arkadi Gurevich

unread,
Aug 20, 2018, 10:39:49 AM8/20/18
to kaldi-help
Hi Daniel,

I want to train tdnn model based on librispeech dataset,
I went through the configuration of the network and try to understand to it's topology.

I attach the topology from "run_tdnn_1h.sh" and I have a few questions for the highlighted lines :

  input dim=100 name=ivector
  input dim=40 name=input
  fixed-affine-layer name=lda input=Append(-1,0,1,ReplaceIndex(ivector, t, 0)) affine-transform-file=$dir/configs/lda.mat

  relu-batchnorm-dropout-layer name=tdnn1 $tdnn_opts dim=768
  tdnnf-layer name=tdnnf2 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=1
  tdnnf-layer name=tdnnf3 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=1
  tdnnf-layer name=tdnnf4 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=1
  tdnnf-layer name=tdnnf5 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=0
  tdnnf-layer name=tdnnf6 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=3
  tdnnf-layer name=tdnnf7 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=3
  tdnnf-layer name=tdnnf8 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=3
  tdnnf-layer name=tdnnf9 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=3
  tdnnf-layer name=tdnnf10 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=3
  tdnnf-layer name=tdnnf11 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=3
  tdnnf-layer name=tdnnf12 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=3
  tdnnf-layer name=tdnnf13 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=3
  linear-component name=prefinal-l dim=192 $linear_opts

  prefinal-layer name=prefinal-chain input=prefinal-l $prefinal_opts small-dim=192 big-dim=768
  output-layer name=output include-log-softmax=false dim=$num_targets $output_opts

  prefinal-layer name=prefinal-xent input=prefinal-l $prefinal_opts small-dim=192 big-dim=768
  output-layer name=output-xent dim=$num_targets learning-rate-factor=$learning_rate_factor $output_opts


1.  What is the dimension of the input layer ( fixed-affine ) ? is it 100+40 ? 

2. I guess tdnn_i+1 get as input window context from tdnn_i , 
    if it so then how do I know what window context size ? is it related to "time-stride"?

3. The two highlighted lines that related to the output of the net, do they define one or two layers ?


Thanks for your time,

Daniel Povey

unread,
Aug 20, 2018, 2:04:31 PM8/20/18
to kaldi-help
I am hoping someone else can respond to this.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/41af789a-2b6f-449c-af14-8b699862be7f%40googlegroups.com.

Shin XXX

unread,
Aug 20, 2018, 10:03:55 PM8/20/18
to kaldi...@googlegroups.com
the "final.config" contains more details, and the command to generate it is also in "run_tdnn_1h.sh"

1.  What is the dimension of the input layer ( fixed-affine ) ? is it 100+40 ? 
it's 220,  input=Append(-1,0,1,ReplaceIndex(ivector, t, 0))  means this layer concatenates 3 input features and 1 ivector feature, results in (3 * 40 + 100 =) 220-dimension feature.
2. I guess tdnn_i+1 get as input window context from tdnn_i , 
    if it so then how do I know what window context size ? is it related to "time-stride"?
yes, it's related to "time-stride", a tdnn layer is kind of like a 1d-convolutional layer, and "time-stride" can be viewed as the "stride" in cnn, so if time-stride is 3, it take the (-3, 0, 3) output of its previous layer as input...read the "final.config" scripts, it tells you the input dim of each layer.  But if you want to know how many acoustic frames each layer covers, I am not sure...I usually just do some math, calculate it myself. 
3. The two highlighted lines that related to the output of the net, do they define one or two layers ?
2. 
Shin


Arkadi Gurevich

unread,
Aug 21, 2018, 2:42:07 AM8/21/18
to kaldi-help
Thanks for your help Shin,

I try to calculate\estimate the size of the resulting model after training.
My equation is : ( # of neurons * neuron size in memory ) + ( num of arc\weights * sizeof(float\double) ,

I am not sure which part of the code represent neuron, whether it's a floating or double variable or more complex variable ?

According to my understanding, the network works with matrices and therefore there is no representation of arcs , if so then how do I know the connection topology between layers ?



Arkadi

Shin XXX

unread,
Aug 21, 2018, 3:32:57 AM8/21/18
to kaldi...@googlegroups.com
 try to calculate\estimate the size of the resulting model after training.
It's simple, run nnet3-info, it tells you how many parameters a mdl has. 
whether it's a floating or double variable or more complex variable ?
I think it should be "BaseFloat", which is "float" in my system. 
   if so then how do I know the connection topology between layers ?
It's all in the "final.config", you'll notice something like "input=Append(1, layer-name)", it represents a link between current layer and the layer named "layer-name".  

Shin


Arkadi Gurevich

unread,
Aug 21, 2018, 4:22:44 AM8/21/18
to kaldi-help
Thank you for the quick reply Shin,
however I probably did not explain my self properly , sorry for that.

1. I want to estimate the size of the model that will be created after training. I do not train him ( for technical reasons ).
Do you have any suggestion how to do this ? 

2. Let's look at the next line :
 "relu-renorm-layer name=tdnn2 dim=512 input=Append(-1,0,1) " and suppose that the previous layer was
 "relu-renorm-layer name=tdnn1 dim=512"

Could you please explain to me this example how I interpret the number of arches between the two layers  ( perhaps also a brief explanation on Append, I have not found documentation on this method ) ? 

3. As I read  time delay neural network architecture for efficient modeling of long temporal contexts paper each layer should have a smaller number of neurons than the previous layer. In this network ( mini-librispeech tdnn 1h ) all the layer have the same dim. What I do not understand ? 

I greatly appreciate your help Shin,
Arkadi

Shin XXX

unread,
Aug 21, 2018, 12:14:50 PM8/21/18
to kaldi...@googlegroups.com
1. I want to estimate the size of the model that will be created after training. I do not train him ( for technical reasons ).
I usually just do some "fake training", e.g. generate some random data to "train" the model for only one iteration and then do nnet3-info... 
But if you're just curious about the number of parameters and don't want to train anything, you need to figure out how many parameters yourself. 
Could you please explain to me this example how I interpret the number of arches between the two layers
As for "number of arcs" and "neurons", I assume "number of arcs" is the transform matrix between 2 layers, that's 512 * 512. p.s. And "neurons" correspond to those float numbers in each layers?
Besides, take a line from "final.config", for example:

component name=tdnnf2.affine type=TdnnComponent input-dim=160 output-dim=1536 l2-regularize=0.008 max-change=0.75 time-offsets=0,1

there are 160 * 1536 number of arcs between its previous layer and tdnnf2. 
 a brief explanation on Append
Suppose the dim of mfcc is 40, and the first layer say its input is Append( -1, 0, 1), that means at time t, you need to concatenate the mfccs at time (t-1), t, (t+1)  as the model input, resulting a 3*40=120 dim input feature.
And I notice you're quoting some lines from xconfig...I starts to doubt maybe final.config is harder to understand? But I do think it provide me with more details than xconfig. It took me some time to understand those config files, and it's worth it.
 should have a smaller number of neurons than the previous layer.
I read that paper long ago, kind of forgot in which line it mentioned this, maybe you could give me more details. I guess you're talking about Fig 1? Cause the number of boxes per layer is getting smaller when the layer moves higher. Those little rectangle boxes are not the "neurons" I mentioned before, they are matrices, and yes they can have the same dim, it's like all small boxes have the same width. 
 
Shin


Reply all
Reply to author
Forward
0 new messages