What's the meaning of content of nnet3's config?

1,535 views
Skip to first unread message

Alice Aria

unread,
Sep 7, 2017, 9:32:40 AM9/7/17
to kaldi-help
I've obtained an example about DNN like this:

cat <<EOF > $dir/configs/network.xconfig
input dim=40 name=input
output name=output-tmo input=Append(-2,-1,0,1,2)
sigmoid-layer name=dnn1 input=Append(input@-4,input@-3,input@-2,input@-1,input@1,input@2,input@3,input@4) dim=2048
sigmoid-layer name=dnn2 dim=2048
sigmoid-layer name=dnn3 dim=2048
output-layer name=output dim=243
EOF

Well, what's meaning of "input@-4"? Is there any documentation about this? And guess the sigmoid-layer is the "<descriptor>" on the homepage? Where can I see all of the available descriptor?

Daniel Povey

unread,
Sep 7, 2017, 2:19:41 PM9/7/17
to kaldi-help
Hm. The syntax x@y is something that's an abbreviation used in
xconfig files. It's an abbreviation for Offset(x, y). I'm afraid
there isn't a documentation page for that, not yet at least. And a
bare number y where a descriptors is expected, as in the numbers in
the expression Append(-2, -1, 0, 1, 2), is equivalent to
Offset(name-of-previous-xconfig-layer, y), e.g. Offset(input, -2) in
this example if y is -2.

There is some documentation about Descriptors in general, here:
http://kaldi-asr.org/doc/dnn3_code_data_types.html
(search for "Descriptors in config files")
> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Alice Aria

unread,
Sep 9, 2017, 10:53:34 AM9/9/17
to kaldi-help
Hi, Dan. I've read the paragraph about descriptor. I feel confused. For example, I've seen that nnet3 support convolution, but how should I write the config with descriptor? And how about bottleneck?

David Snyder

unread,
Sep 9, 2017, 12:18:04 PM9/9/17
to kaldi-help
Hi Alice,

It might help to look around in the tuning directories for examples of nnet3 xconfigs options you want to use. For example, this script https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/local/chain/tuning/run_cnn_tdnn_1a.sh uses convolutional layers, and this script https://github.com/kaldi-asr/kaldi/blob/master/egs/babel_multilang/s5/local/nnet3/run_tdnn_multilingual.sh has a bottleneck. 

Best,
David

Daniel Povey

unread,
Sep 9, 2017, 1:29:35 PM9/9/17
to kaldi-help
Descriptors are just for attaching components together; convolution
would be inside a component.

A bottleneck would just be a layer that's thinner than the others;
then later on to actually dump the bottleneck features you'd change
the definition of the output of the neural net to point to the
bottleneck instead. E.g. something like

nnet3-copy --config='echo output-node name=output
input=some_component_node|' final.mdl temp.mdl

Dan
Message has been deleted

Alice Aria

unread,
Sep 10, 2017, 4:31:29 AM9/10/17
to kaldi-help
# Note: we hardcode in the CNN config that we are dealing with 32x3x color
  # images.

  common="required-time-offsets=0 height-offsets=-1,0,1 num-filters-out=32"

  mkdir -p $dir/configs
  cat <<EOF > $dir/configs/network.xconfig
  input dim=96 name=input
  conv-relu-batchnorm-layer name=cnn1 height-in=32 height-out=32 time-offsets=-1,0,1 $common
  conv-relu-batchnorm-layer name=cnn2 height-in=32 height-out=32 time-offsets=-1,0,1 $common
  conv-relu-batchnorm-layer name=cnn3 height-in=32 height-out=32 time-offsets=-1,0,1 $common
  conv-relu-batchnorm-layer name=cnn4 height-in=32 height-out=16 time-offsets=-1,0,1 $common height-subsample-out=2
  conv-relu-batchnorm-layer name=cnn5 height-in=16 height-out=16 time-offsets=-2,0,2 $common
  conv-relu-batchnorm-layer name=cnn6 height-in=16 height-out=16 time-offsets=-2,0,2 $common
  conv-relu-batchnorm-layer name=cnn7 height-in=16 height-out=8  time-offsets=-2,0,2 $common height-subsample-out=2
  conv-relu-batchnorm-layer name=cnn8 height-in=8 height-out=8   time-offsets=-4,0,4 $common
  conv-relu-batchnorm-layer name=cnn9 height-in=8 height-out=8   time-offsets=-4,0,4 $common
  conv-relu-batchnorm-layer name=cnn10 height-in=8 height-out=4   time-offsets=-4,0,4 $common height-subsample-out=2
  conv-relu-batchnorm-layer name=cnn11 height-in=4 height-out=4   time-offsets=-8,0,8 $common
  conv-relu-batchnorm-layer name=cnn12 height-in=4 height-out=4   time-offsets=-8,0,8 $common
  relu-batchnorm-layer name=fully_connected1 input=Append(0,8,16,24) dim=128
  relu-batchnorm-layer name=fully_connected2 dim=256
  output-layer name=output dim=$num_targets
EOF

I also feel confused. The code above is part of cifar example. I guess the input matrix is 32x3 (Why 32x3? As I know, cifar is 32x32x3 matrix.), the `height-in` means that the input `time`? So I guess
  conv-relu-batchnorm-layer name=cnn1 height-in=32 height-out=32 time-offsets=-1,0,1 $common
  conv-relu-batchnorm-layer name=cnn2 height-in=32 height-out=32 time-offsets=-1,0,1 $common
  conv-relu-batchnorm-layer name=cnn3 height-in=32 height-out=32 time-offsets=-1,0,1 $common
  conv-relu-batchnorm-layer name=cnn4 height-in=32 height-out=16 time-offsets=-1,0,1 $common height-subsample-out=2
What these four lines do? The top three lines do convolution in 3 channel, doesn't they? And cnn4 is one pooling layer? I guess the `time-offsets` and `height-offsets` indicate the conv vilter size, am I right? The `height-offsets` keep invariant and 'time-offsets` become larger and larger. I cannot have a direct view about the whole structure.

Daniel Povey

unread,
Sep 10, 2017, 2:58:19 PM9/10/17
to kaldi-help
> I also feel confused. The code above is part of cifar example. I guess the
> input matrix is 32x3 (Why 32x3? As I know, cifar is 32x32x3 matrix.),

Each vertical stripe of the image (32 pixels with 3 colors) becomes
one row of the input matrix (you could think of this as one "frame").
So the input "feature dimension" is 32 x 3, but there would be 32
frames of that input for each input image.

> the
> `height-in` means that the input `time`? So I guess
> conv-relu-batchnorm-layer name=cnn1 height-in=32 height-out=32
> time-offsets=-1,0,1 $common
> conv-relu-batchnorm-layer name=cnn2 height-in=32 height-out=32
> time-offsets=-1,0,1 $common
> conv-relu-batchnorm-layer name=cnn3 height-in=32 height-out=32
> time-offsets=-1,0,1 $common
> conv-relu-batchnorm-layer name=cnn4 height-in=32 height-out=16
> time-offsets=-1,0,1 $common height-subsample-out=2
> What these four lines do? The top three lines do convolution in 3 channel,
> doesn't they?

(for reference: common="required-time-offsets=0 height-offsets=-1,0,1
num-filters-out=32").

The first line expands to:
conv-relu-batchnorm-layer name=cnn1 height-in=32 height-out=32
time-offsets=-1,0,1 required-time-offsets=0 height-offsets=-1,0,1
num-filters-out=32

This does convolution on an input image with height 32, producing an
output image with height 32. The available width at the output will
be the same as the input width because of "required-time-offsets=0"
(i.e. it doesn't require left and right context, it pads with zeros);
in practice both input and output width will be 32.
The num-filters-in is implicit, it's worked out from the input
dimension; in fact it is 3. The num-filters-out is 32. So the
"feature dimension" at the output is 32 (height) * 32 (num-filters).

"time-offsets=-1,0,1" and "height-offsets=-1,0,1" means that it's a
3x3 filter that has no "gaps". (the framework is actually more
general than regular convolution but you normally won't have gaps).

The next 3 lines are self-explanatory given the above.
The last line is:

conv-relu-batchnorm-layer name=cnn4 height-in=32 height-out=16
time-offsets=-1,0,1 height-subsample-out=2 required-time-offsets=0
height-offsets=-1,0,1 num-filters-out=32

which produces an output of height 16 because it subsamples every 2
frames of the input. In fact, we will also produce only every 2 "t"
values at the output, but this is not specified here, it's implicit in
later layers. In nnet3 you don't specify what frames you want it to
compute, you specify what frames you need as dependencies from later
layers and it works out what frames it has to compute. So in that
particular sense it's more declarative (less imperative) than the
standard frameworks.

Dan







> And cnn4 is one pooling layer? I guess the `time-offsets` and
> `height-offsets` indicate the conv vilter size, am I right? The
> `height-offsets` keep invariant and 'time-offsets` become larger and larger.
> I cannot have a direct view about the whole structure.
>
> On Sunday, September 10, 2017 at 1:29:35 AM UTC+8, Dan Povey wrote:
>>

Daniel Povey

unread,
Sep 10, 2017, 3:00:01 PM9/10/17
to kaldi-help
Correction, sorry: I mean it subsamples vertically. So heights 0, 2,
4... 14 would be retained. This has nothing to do with frames.

Alice Aria

unread,
Sep 10, 2017, 10:16:10 PM9/10/17
to kaldi-help

Emmmm, I think I still not understand. 
Does `time-offsets=-1,0,1` mean that cnn1 does convolution with size: 3(time axis)? You say `required-time-offsets=0` indicates the context number, so I think this option is set for TDNN not standrad CNN? 
Along "height" how it do convolution?  What's the input send to cnn1, the whole 32x96 or 32(time)x32(the left 32 cols)? cnn1, cnn2, cnn3 do same operation? Their topo structure is cnn1->cnn2->cnn3-> or (cnn1+cnn2+cnn3)->cnn4?
The structure is so different from other config from the "layer" perspective, I cannot imag the input and output of layer according to the config.

Daniel Povey

unread,
Sep 10, 2017, 10:33:31 PM9/10/17
to kaldi-help
There are much more details in comments in the header nnet-convolutional-component.h.
It's  cnn1->cnn2->cnn3->cnn4.
It's all 3x3 convolutions.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

Alice Aria

unread,
Sep 10, 2017, 10:42:43 PM9/10/17
to kaldi-help
I feel so sorry that I've not checked the source code. I'll take a closer look.
Thanks for your explanation. :)
Reply all
Reply to author
Forward
0 new messages