> I also feel confused. The code above is part of cifar example. I guess the
> input matrix is 32x3 (Why 32x3? As I know, cifar is 32x32x3 matrix.),
Each vertical stripe of the image (32 pixels with 3 colors) becomes
one row of the input matrix (you could think of this as one "frame").
So the input "feature dimension" is 32 x 3, but there would be 32
frames of that input for each input image.
> the
> `height-in` means that the input `time`? So I guess
> conv-relu-batchnorm-layer name=cnn1 height-in=32 height-out=32
> time-offsets=-1,0,1 $common
> conv-relu-batchnorm-layer name=cnn2 height-in=32 height-out=32
> time-offsets=-1,0,1 $common
> conv-relu-batchnorm-layer name=cnn3 height-in=32 height-out=32
> time-offsets=-1,0,1 $common
> conv-relu-batchnorm-layer name=cnn4 height-in=32 height-out=16
> time-offsets=-1,0,1 $common height-subsample-out=2
> What these four lines do? The top three lines do convolution in 3 channel,
> doesn't they?
(for reference: common="required-time-offsets=0 height-offsets=-1,0,1
num-filters-out=32").
The first line expands to:
conv-relu-batchnorm-layer name=cnn1 height-in=32 height-out=32
time-offsets=-1,0,1 required-time-offsets=0 height-offsets=-1,0,1
num-filters-out=32
This does convolution on an input image with height 32, producing an
output image with height 32. The available width at the output will
be the same as the input width because of "required-time-offsets=0"
(i.e. it doesn't require left and right context, it pads with zeros);
in practice both input and output width will be 32.
The num-filters-in is implicit, it's worked out from the input
dimension; in fact it is 3. The num-filters-out is 32. So the
"feature dimension" at the output is 32 (height) * 32 (num-filters).
"time-offsets=-1,0,1" and "height-offsets=-1,0,1" means that it's a
3x3 filter that has no "gaps". (the framework is actually more
general than regular convolution but you normally won't have gaps).
The next 3 lines are self-explanatory given the above.
The last line is:
conv-relu-batchnorm-layer name=cnn4 height-in=32 height-out=16
time-offsets=-1,0,1 height-subsample-out=2 required-time-offsets=0
height-offsets=-1,0,1 num-filters-out=32
which produces an output of height 16 because it subsamples every 2
frames of the input. In fact, we will also produce only every 2 "t"
values at the output, but this is not specified here, it's implicit in
later layers. In nnet3 you don't specify what frames you want it to
compute, you specify what frames you need as dependencies from later
layers and it works out what frames it has to compute. So in that
particular sense it's more declarative (less imperative) than the
standard frameworks.
Dan
> And cnn4 is one pooling layer? I guess the `time-offsets` and
> `height-offsets` indicate the conv vilter size, am I right? The
> `height-offsets` keep invariant and 'time-offsets` become larger and larger.
> I cannot have a direct view about the whole structure.
>
> On Sunday, September 10, 2017 at 1:29:35 AM UTC+8, Dan Povey wrote:
>>