How to train a binary neural networks for speech recognition

fei

unread,

Aug 1, 2018, 8:39:16 AM8/1/18

to kaldi-help

Hi,

In order to reduce the model size, I want to train a binary neural network .Has anyone tried this? Or can you give me some advice?

Daniel Povey

unread,

Aug 1, 2018, 3:51:52 PM8/1/18

to kaldi-help

You can always compress the model after it's trained-- a few bits should be enough.

I have a hard time thinking of a scenario where this would matter though. Normally in situations where on-device storage is at a premium, speed is at a premium too so you'd need a model that's fast to evaluate; I doubt that binary weights would really help there.

Dan

On Wed, Aug 1, 2018 at 5:39 AM, fei <chenshu...@gmail.com> wrote:

Hi,
In order to reduce the model size, I want to train a binary neural network .Has anyone tried this? Or can you give me some advice?

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/e05e49d4-f4a0-4286-9618-236b00e74538%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Arkadi Gurevich

unread,

Aug 19, 2018, 1:46:24 AM8/19/18

to kaldi-help

Hi Daniel,

According to this article " Binarized Neural Networks ", this technique should reduce memory size during the forward pass.

Don't you think it may work on kaldi's nnet ?

All best,

Arkadi

On Wednesday, August 1, 2018 at 10:51:52 PM UTC+3, Dan Povey wrote:

You can always compress the model after it's trained-- a few bits should be enough.
I have a hard time thinking of a scenario where this would matter though. Normally in situations where on-device storage is at a premium, speed is at a premium too so you'd need a model that's fast to evaluate; I doubt that binary weights would really help there.
Dan

On Wed, Aug 1, 2018 at 5:39 AM, fei <chenshu...@gmail.com> wrote:

Hi,
In order to reduce the model size, I want to train a binary neural network .Has anyone tried this? Or can you give me some advice?

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

Daniel Povey

unread,

Aug 19, 2018, 1:52:21 PM8/19/18

to kaldi-help

Sure, it might work, but would require some work. And it depends what
your target low-power device is, whether it can actually make use of
that (i.e. efficiently in time). My feeling is that in many such
cases, it would make sense to just train a smaller model in the first
place, because a binarized or compressed model would often be too slow
to evaluate. Certainly storing each parameter in a byte might make
sense for some devices, but it's not implemented yet. Device support
gets very complicated and time-consuming and I think Kaldi can have
more impact by concentrating on the more core stuff.

> To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/9c91e1da-153a-472a-8060-2ce6596957c2%40googlegroups.com.

Arkadi Gurevich

unread,

Aug 19, 2018, 3:31:22 PM8/19/18

to kaldi-help

Thanks Daniel,

I'd be happy to hear your advice about the project I'm working on,

I want to build a phoneme recognition CNN ( probably a TIMIT style network )

but minimize memory footprint ( up to 10 MB ).

I am aware that the WER will increase .

I have read a number of articles on different techniques to reduce model's memory consumption,

for example binarized or compressed model, weights quantization ,weights conversion from float-point to fixed-point, pruning weights etc.

The steps I thought to do were :

1. Train nnet based on mini-librispeech ( or another small model ).

2. I will use one of the techniques to reduce the size of the model .

3. I will build CNN with the weights received in training

Do you think it is possible to reach these sizes in the neural network ?

Does this process sound logical to you ?

How would you recommend reaching the goal ( phoneme recognition based CNN with memory consumption up to 10 MB ) ?

Arkadi

Daniel Povey

unread,

Aug 19, 2018, 3:40:01 PM8/19/18

to kaldi-help

Our models for mini_librispeech right now are 20M on disk, so even by
reducing the floats to 16 bit you'd get those models to 10M. I
suggest to work on other aspects of the project first though-- i.e.
see if you can accomplish the task with a bigger model before worrying
about compressing it.

Dan

> To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/024a62fc-075b-4bc9-8914-530a1cd068f6%40googlegroups.com.

Message has been deleted

Arkadi Gurevich

unread,

Aug 20, 2018, 10:39:49 AM8/20/18

to kaldi-help

Hi Daniel,

I want to train tdnn model based on librispeech dataset,

I went through the configuration of the network and try to understand to it's topology.

I attach the topology from "run_tdnn_1h.sh" and I have a few questions for the highlighted lines :

input dim=100 name=ivector
input dim=40 name=input

fixed-affine-layer name=lda input=Append(-1,0,1,ReplaceIndex(ivector, t, 0)) affine-transform-file=$dir/configs/lda.mat

relu-batchnorm-dropout-layer name=tdnn1 $tdnn_opts dim=768
tdnnf-layer name=tdnnf2 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=1
tdnnf-layer name=tdnnf3 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=1
tdnnf-layer name=tdnnf4 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=1
tdnnf-layer name=tdnnf5 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=0
tdnnf-layer name=tdnnf6 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=3
tdnnf-layer name=tdnnf7 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=3
tdnnf-layer name=tdnnf8 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=3
tdnnf-layer name=tdnnf9 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=3
tdnnf-layer name=tdnnf10 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=3
tdnnf-layer name=tdnnf11 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=3
tdnnf-layer name=tdnnf12 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=3
tdnnf-layer name=tdnnf13 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=3

linear-component name=prefinal-l dim=192 $linear_opts

prefinal-layer name=prefinal-chain input=prefinal-l $prefinal_opts small-dim=192 big-dim=768
output-layer name=output include-log-softmax=false dim=$num_targets $output_opts

prefinal-layer name=prefinal-xent input=prefinal-l $prefinal_opts small-dim=192 big-dim=768
output-layer name=output-xent dim=$num_targets learning-rate-factor=$learning_rate_factor $output_opts

1. What is the dimension of the input layer ( fixed-affine ) ? is it 100+40 ?

2. I guess tdnn_i+1 get as input window context from tdnn_i ,

if it so then how do I know what window context size ? is it related to "time-stride"?

3. The two highlighted lines that related to the output of the net, do they define one or two layers ?

Thanks for your time,

Daniel Povey

unread,

Aug 20, 2018, 2:04:31 PM8/20/18

to kaldi-help

I am hoping someone else can respond to this.

> To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/41af789a-2b6f-449c-af14-8b699862be7f%40googlegroups.com.

Shin XXX

unread,

Aug 20, 2018, 10:03:55 PM8/20/18

to kaldi...@googlegroups.com

the "final.config" contains more details, and the command to generate it is also in "run_tdnn_1h.sh"

1. What is the dimension of the input layer ( fixed-affine ) ? is it 100+40 ?

it's 220, input=Append(-1,0,1,ReplaceIndex(ivector, t, 0)) means this layer concatenates 3 input features and 1 ivector feature, results in (3 * 40 + 100 =) 220-dimension feature.

2. I guess tdnn_i+1 get as input window context from tdnn_i ,
if it so then how do I know what window context size ? is it related to "time-stride"?

yes, it's related to "time-stride", a tdnn layer is kind of like a 1d-convolutional layer, and "time-stride" can be viewed as the "stride" in cnn, so if time-stride is 3, it take the (-3, 0, 3) output of its previous layer as input...read the "final.config" scripts, it tells you the input dim of each layer. But if you want to know how many acoustic frames each layer covers, I am not sure...I usually just do some math, calculate it myself.

3. The two highlighted lines that related to the output of the net, do they define one or two layers ?

2.

Shin

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/41af789a-2b6f-449c-af14-8b699862be7f%40googlegroups.com.

Arkadi Gurevich

unread,

Aug 21, 2018, 2:42:07 AM8/21/18

to kaldi-help

Thanks for your help Shin,

I try to calculate\estimate the size of the resulting model after training.

My equation is : ( # of neurons * neuron size in memory ) + ( num of arc\weights * sizeof(float\double) ,

I am not sure which part of the code represent neuron, whether it's a floating or double variable or more complex variable ?

According to my understanding, the network works with matrices and therefore there is no representation of arcs , if so then how do I know the connection topology between layers ?

Arkadi

Shin XXX

unread,

Aug 21, 2018, 3:32:57 AM8/21/18

to kaldi...@googlegroups.com

try to calculate\estimate the size of the resulting model after training.

It's simple, run nnet3-info, it tells you how many parameters a mdl has.

whether it's a floating or double variable or more complex variable ?

I think it should be "BaseFloat", which is "float" in my system.

if so then how do I know the connection topology between layers ?

It's all in the "final.config", you'll notice something like "input=Append(1, layer-name)", it represents a link between current layer and the layer named "layer-name".

Shin

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/f6dabc2f-4c36-4c99-bd41-432c7ee21b8e%40googlegroups.com.

Arkadi Gurevich

unread,

Aug 21, 2018, 4:22:44 AM8/21/18

to kaldi-help

Thank you for the quick reply Shin,

however I probably did not explain my self properly , sorry for that.

1. I want to estimate the size of the model that will be created after training. I do not train him ( for technical reasons ).

Do you have any suggestion how to do this ?

2. Let's look at the next line :

"relu-renorm-layer name=tdnn2 dim=512 input=Append(-1,0,1) " and suppose that the previous layer was

"relu-renorm-layer name=tdnn1 dim=512"

Could you please explain to me this example how I interpret the number of arches between the two layers ( perhaps also a brief explanation on Append, I have not found documentation on this method ) ?

3. As I read time delay neural network architecture for efficient modeling of long temporal contexts paper each layer should have a smaller number of neurons than the previous layer. In this network ( mini-librispeech tdnn 1h ) all the layer have the same dim. What I do not understand ?

I greatly appreciate your help Shin,

Arkadi

Shin XXX

unread,

Aug 21, 2018, 12:14:50 PM8/21/18

to kaldi...@googlegroups.com

1. I want to estimate the size of the model that will be created after training. I do not train him ( for technical reasons ).

I usually just do some "fake training", e.g. generate some random data to "train" the model for only one iteration and then do nnet3-info...

But if you're just curious about the number of parameters and don't want to train anything, you need to figure out how many parameters yourself.

Could you please explain to me this example how I interpret the number of arches between the two layers

As for "number of arcs" and "neurons", I assume "number of arcs" is the transform matrix between 2 layers, that's 512 * 512. p.s. And "neurons" correspond to those float numbers in each layers?

Besides, take a line from "final.config", for example:

component name=tdnnf2.affine type=TdnnComponent input-dim=160 output-dim=1536 l2-regularize=0.008 max-change=0.75 time-offsets=0,1

there are 160 * 1536 number of arcs between its previous layer and tdnnf2.

a brief explanation on Append

Suppose the dim of mfcc is 40, and the first layer say its input is Append( -1, 0, 1), that means at time t, you need to concatenate the mfccs at time (t-1), t, (t+1) as the model input, resulting a 3*40=120 dim input feature.

And I notice you're quoting some lines from xconfig...I starts to doubt maybe final.config is harder to understand? But I do think it provide me with more details than xconfig. It took me some time to understand those config files, and it's worth it.

should have a smaller number of neurons than the previous layer.

I read that paper long ago, kind of forgot in which line it mentioned this, maybe you could give me more details. I guess you're talking about Fig 1? Cause the number of boxes per layer is getting smaller when the layer moves higher. Those little rectangle boxes are not the "neurons" I mentioned before, they are matrices, and yes they can have the same dim, it's like all small boxes have the same width.

Shin

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/d45e0488-b4c2-4ce1-bb30-b5dd794236a9%40googlegroups.com.

Reply all

Reply to author

Forward