nnet3-chain-train output not computable?

142 views
Skip to first unread message

ma...@nyu.edu

unread,
Oct 31, 2020, 4:40:30 PM10/31/20
to kaldi-help
In attempting to debug AM adaptation I am also trying to work from GMM as well as NNET alignments. I got a little further with this route (or so I thought) but now I see the message:

LOG (nnet3-chain-train[5.5.811~1-bcd163]:ExplainWhyAllOutputsNotComputable():nnet-computation-graph.cc:351) 200 output cindexes out of 200 were not computable.

I have no idea what this means. I am sure I am doing something stupid but I have no idea what that is. Could you give me some hints at least where to look?

Thanks
Michael

Here is the a bigger excerpt from the log:


nnet3-chain-train --use-gpu=yes --apply-deriv-weights=False --l2-regularize=5e-05 --leaky-hmm-coefficient=0.1 --write-cache=exp/chain/tdnn_librispeech_malach_1b/cache.1 --xent-regularize=0.1 --print-interval=10 --momentum=0.0 --max-param-change=1.41421356237 --backstitch-training-scale=0.0 --backstitch-training-interval=1 --l2-regularize-factor=1.0 --srand=0 'nnet3-am-copy --raw=true --learning-rate=0.005 --scale=1.0 exp/chain/tdnn_librispeech_malach_1b/0.mdl - |' exp/chain/tdnn_librispeech_malach_1b/den.fst 'ark,bg:nnet3-chain-copy-egs                          --frame-shift=1                         ark:exp/chain/tdnn_librispeech_malach_1b/egs/cegs.1.ark ark:- |                         nnet3-chain-shuffle-egs --buffer-size=5000                         --srand=0 ark:- ark:- | nnet3-chain-merge-egs                         --minibatch-size=64 ark:- ark:- |' exp/chain/tdnn_librispeech_malach_1b/1.1.raw 
WARNING (nnet3-chain-train[5.5.811~1-bcd163]:SelectGpuId():cu-device.cc:228) Not in compute-exclusive mode.  Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
LOG (nnet3-chain-train[5.5.811~1-bcd163]:SelectGpuIdAuto():cu-device.cc:408) Selecting from 1 GPUs
LOG (nnet3-chain-train[5.5.811~1-bcd163]:SelectGpuIdAuto():cu-device.cc:423) cudaSetDevice(0): GeForce RTX 2080 Ti free:10852M, used:167M, total:11019M, free/total:0.984828
LOG (nnet3-chain-train[5.5.811~1-bcd163]:SelectGpuIdAuto():cu-device.cc:471) Device: 0, mem_ratio: 0.984828
LOG (nnet3-chain-train[5.5.811~1-bcd163]:SelectGpuId():cu-device.cc:352) Trying to select device: 0
LOG (nnet3-chain-train[5.5.811~1-bcd163]:SelectGpuIdAuto():cu-device.cc:481) Success selecting device 0 free mem ratio: 0.984828
LOG (nnet3-chain-train[5.5.811~1-bcd163]:FinalizeActiveGpu():cu-device.cc:308) The active GPU is [0]: GeForce RTX 2080 Ti free:10610M, used:409M, total:11019M, free/total:0.962867 version 7.5
nnet3-am-copy --raw=true --learning-rate=0.005 --scale=1.0 exp/chain/tdnn_librispeech_malach_1b/0.mdl - 
LOG (nnet3-am-copy[5.5.811~1-bcd163]:main():nnet3-am-copy.cc:153) Copied neural net from exp/chain/tdnn_librispeech_malach_1b/0.mdl to raw format as -
nnet3-chain-shuffle-egs --buffer-size=5000 --srand=0 ark:- ark:- 
nnet3-chain-merge-egs --minibatch-size=64 ark:- ark:- 
nnet3-chain-copy-egs --frame-shift=1 ark:exp/chain/tdnn_librispeech_malach_1b/egs/cegs.1.ark ark:- 
LOG (nnet3-chain-train[5.5.811~1-bcd163]:ExplainWhyAllOutputsNotComputable():nnet-computation-graph.cc:351) 200 output cindexes out of 200 were not computable.
LOG (nnet3-chain-train[5.5.811~1-bcd163]:ExplainWhyAllOutputsNotComputable():nnet-computation-graph.cc:355) Computation request was:  # Computation request:
input-0: name=input, has-deriv=false, indexes=[(0,-40:191), (1,-40:191)]
output-0: name=output, has-deriv=true, indexes=[(0,0), (1,0), (0,3), (1,3), (0,6), (1,6), (0,9), (1,9), (0,12), (1,12), (0,15), (1,15), (0,18), (1, ... , (1,132), (0,135), (1,135), (0,138), (1,138), (0,141), (1,141), (0,144), (1,144), (0,147), (1,147)]
output-1: name=output-xent, has-deriv=true, indexes=[(0,0), (1,0), (0,3), (1,3), (0,6), (1,6), (0,9), (1,9), (0,12), (1,12), (0,15), (1,15), (0,18), (1, ... , (1,132), (0,135), (1,135), (0,138), (1,138), (0,141), (1,141), (0,144), (1,144), (0,147), (1,147)]
need-model-derivative: true
store-component-stats: true

LOG (nnet3-chain-train[5.5.811~1-bcd163]:ExplainWhyAllOutputsNotComputable():nnet-computation-graph.cc:357) Printing the reasons for 10 of these.
LOG (nnet3-chain-train[5.5.811~1-bcd163]:ExplainWhyNotComputable():nnet-computation-graph.cc:172) *** cindex output(0, 0, 0) is not computable for the following reason: ***
output(0, 0, 0) is kNotComputable, dependencies: output.affine(0, 0, 0)[kNotComputable], 
output.affine(0, 0, 0) is kNotComputable, dependencies: output.affine_input(0, 0, 0)[kNotComputable], 
output.affine_input(0, 0, 0) is kNotComputable, dependencies: prefinal-chain.batchnorm2(0, 0, 0)[kNotComputable], 
prefinal-chain.batchnorm2(0, 0, 0) is kNotComputable, dependencies: prefinal-chain.batchnorm2_input(0, 0, 0)[kNotComputable], 
prefinal-chain.batchnorm2_input(0, 0, 0) is kNotComputable, dependencies: prefinal-chain.linear(0, 0, 0)[kNotComputable], 
prefinal-chain.linear(0, 0, 0) is kNotComputable, dependencies: prefinal-chain.linear_input(0, 0, 0)[kNotComputable], 
prefinal-chain.linear_input(0, 0, 0) is kNotComputable, dependencies: prefinal-chain.batchnorm1(0, 0, 0)[kNotComputable], 
prefinal-chain.batchnorm1(0, 0, 0) is kNotComputable, dependencies: prefinal-chain.batchnorm1_input(0, 0, 0)[kNotComputable], 
prefinal-chain.batchnorm1_input(0, 0, 0) is kNotComputable, dependencies: prefinal-chain.relu(0, 0, 0)[kNotComputable], 
prefinal-chain.relu(0, 0, 0) is kNotComputable, dependencies: prefinal-chain.relu_input(0, 0, 0)[kNotComputable], 
prefinal-chain.relu_input(0, 0, 0) is kNotComputable, dependencies: prefinal-chain.affine(0, 0, 0)[kNotComputable], 
prefinal-chain.affine(0, 0, 0) is kNotComputable, dependencies: prefinal-chain.affine_input(0, 0, 0)[kNotComputable], 
prefinal-chain.affine_input(0, 0, 0) is kNotComputable, dependencies: prefinal-l(0, 0, 0)[kNotComputable], 
prefinal-l(0, 0, 0) is kNotComputable, dependencies: prefinal-l_input(0, 0, 0)[kNotComputable], 
prefinal-l_input(0, 0, 0) is kNotComputable, dependencies: tdnnf17.noop(0, 0, 0)[kNotComputable], 
tdnnf17.noop(0, 0, 0) is kNotComputable, dependencies: tdnnf17.noop_input(0, 0, 0)[kNotComputable], 
tdnnf17.noop_input(0, 0, 0) is kNotComputable, dependencies: tdnnf16.noop(0, 0, 0)[kNotComputable]tdnnf17.dropout(0, 0, 0)[kUnknown], 
tdnnf16.noop(0, 0, 0) is kNotComputable, dependencies: tdnnf16.noop_input(0, 0, 0)[kNotComputable], 
tdnnf17.dropout(0, 0, 0) is kUnknown, dependencies: tdnnf17.dropout_input(0, 0, 0)[kUnknown], 
tdnnf16.noop_input(0, 0, 0) is kNotComputable, dependencies: tdnnf15.noop(0, 0, 0)[kNotComputable]tdnnf16.dropout(0, 0, 0)[kUnknown], 
tdnnf17.dropout_input(0, 0, 0) is kUnknown, dependencies: tdnnf17.batchnorm(0, 0, 0)[kUnknown], 
tdnnf15.noop(0, 0, 0) is kNotComputable, dependencies: tdnnf15.noop_input(0, 0, 0)[kNotComputable], 
tdnnf16.dropout(0, 0, 0) is kUnknown, dependencies: tdnnf16.dropout_input(0, 0, 0)[kUnknown], 
tdnnf17.batchnorm(0, 0, 0) is kUnknown, dependencies: tdnnf17.batchnorm_input(0, 0, 0)[kUnknown], 
tdnnf15.noop_input(0, 0, 0) is kNotComputable, dependencies: tdnnf14.noop(0, 0, 0)[kNotComputable]tdnnf15.dropout(0, 0, 0)[kUnknown], 
tdnnf16.dropout_input(0, 0, 0) is kUnknown, dependencies: tdnnf16.batchnorm(0, 0, 0)[kUnknown], 
tdnnf17.batchnorm_input(0, 0, 0) is kUnknown, dependencies: tdnnf17.relu(0, 0, 0)[kUnknown], 
tdnnf14.noop(0, 0, 0) is kNotComputable, dependencies: tdnnf14.noop_input(0, 0, 0)[kNotComputable], 
tdnnf15.dropout(0, 0, 0) is kUnknown, dependencies: tdnnf15.dropout_input(0, 0, 0)[kUnknown], 
tdnnf16.batchnorm(0, 0, 0) is kUnknown, dependencies: tdnnf16.batchnorm_input(0, 0, 0)[kUnknown], 
tdnnf17.relu(0, 0, 0) is kUnknown, dependencies: tdnnf17.relu_input(0, 0, 0)[kUnknown], 
tdnnf14.noop_input(0, 0, 0) is kNotComputable, dependencies: tdnnf13.noop(0, 0, 0)[kNotComputable]tdnnf14.dropout(0, 0, 0)[kUnknown], 
tdnnf15.dropout_input(0, 0, 0) is kUnknown, dependencies: tdnnf15.batchnorm(0, 0, 0)[kUnknown], 
tdnnf16.batchnorm_input(0, 0, 0) is kUnknown, dependencies: tdnnf16.relu(0, 0, 0)[kUnknown], 
tdnnf17.relu_input(0, 0, 0) is kUnknown, dependencies: tdnnf17.affine(0, 0, 0)[kUnknown], 
tdnnf13.noop(0, 0, 0) is kNotComputable, dependencies: tdnnf13.noop_input(0, 0, 0)[kNotComputable], 
tdnnf14.dropout(0, 0, 0) is kUnknown, dependencies: tdnnf14.dropout_input(0, 0, 0)[kUnknown], 
tdnnf15.batchnorm(0, 0, 0) is kUnknown, dependencies: tdnnf15.batchnorm_input(0, 0, 0)[kUnknown], 
tdnnf16.relu(0, 0, 0) is kUnknown, dependencies: tdnnf16.relu_input(0, 0, 0)[kUnknown], 
tdnnf17.affine(0, 0, 0) is kUnknown, dependencies: tdnnf17.affine_input(0, 0, 0)[kUnknown]tdnnf17.affine_input(0, 3, 0)[kUnknown], 
tdnnf13.noop_input(0, 0, 0) is kNotComputable, dependencies: tdnnf12.noop(0, 0, 0)[kNotComputable]tdnnf13.dropout(0, 0, 0)[kUnknown], 
tdnnf14.dropout_input(0, 0, 0) is kUnknown, dependencies: tdnnf14.batchnorm(0, 0, 0)[kUnknown], 
tdnnf15.batchnorm_input(0, 0, 0) is kUnknown, dependencies: tdnnf15.relu(0, 0, 0)[kUnknown], 
tdnnf16.relu_input(0, 0, 0) is kUnknown, dependencies: tdnnf16.affine(0, 0, 0)[kUnknown], 
tdnnf17.affine_input(0, 0, 0) is kUnknown, dependencies: tdnnf17.linear(0, 0, 0)[kUnknown], 
tdnnf17.affine_input(0, 3, 0) is kUnknown, dependencies: tdnnf17.linear(0, 3, 0)[kUnknown], 
tdnnf12.noop(0, 0, 0) is kNotComputable, dependencies: tdnnf12.noop_input(0, 0, 0)[kNotComputable], 
tdnnf13.dropout(0, 0, 0) is kUnknown, dependencies: tdnnf13.dropout_input(0, 0, 0)[kUnknown], 
tdnnf14.batchnorm(0, 0, 0) is kUnknown, dependencies: tdnnf14.batchnorm_input(0, 0, 0)[kUnknown], 
tdnnf15.relu(0, 0, 0) is kUnknown, dependencies: tdnnf15.relu_input(0, 0, 0)[kUnknown], 
tdnnf16.affine(0, 0, 0) is kUnknown, dependencies: tdnnf16.affine_input(0, 0, 0)[kNotComputable]tdnnf16.affine_input(0, 3, 0)[kUnknown], 
tdnnf17.linear(0, 0, 0) is kUnknown, dependencies: tdnnf17.linear_input(0, -3, 0)[kUnknown]tdnnf17.linear_input(0, 0, 0)[kNotComputable], 
tdnnf17.linear(0, 3, 0) is kUnknown, dependencies: tdnnf17.linear_input(0, 0, 0)[kNotComputable]tdnnf17.linear_input(0, 3, 0)[kNotComputable], 
tdnnf12.noop_input(0, 0, 0) is kNotComputable, dependencies: tdnnf11.noop(0, 0, 0)[kNotComputable]tdnnf12.dropout(0, 0, 0)[kUnknown], 
tdnnf13.dropout_input(0, 0, 0) is kUnknown, dependencies: tdnnf13.batchnorm(0, 0, 0)[kUnknown], 
tdnnf14.batchnorm_input(0, 0, 0) is kUnknown, dependencies: tdnnf14.relu(0, 0, 0)[kUnknown], 
tdnnf15.relu_input(0, 0, 0) is kUnknown, dependencies: tdnnf15.affine(0, 0, 0)[kUnknown], 
tdnnf16.affine_input(0, 0, 0) is kNotComputable, dependencies: tdnnf16.linear(0, 0, 0)[kNotComputable], 
tdnnf16.affine_input(0, 3, 0) is kUnknown, dependencies: tdnnf16.linear(0, 3, 0)[kUnknown], 
tdnnf17.linear_input(0, -3, 0) is kUnknown, dependencies: tdnnf16.noop(0, -3, 0)[kUnknown], 
tdnnf17.linear_input(0, 0, 0) is kNotComputable, dependencies: tdnnf16.noop(0, 0, 0)[kNotComputable], 
tdnnf17.linear_input(0, 3, 0) is kNotComputable, dependencies: tdnnf16.noop(0, 3, 0)[kNotComputable], 
tdnnf11.noop(0, 0, 0) is kNotComputable, dependencies: tdnnf11.noop_input(0, 0, 0)[kNotComputable], 
tdnnf12.dropout(0, 0, 0) is kUnknown, dependencies: tdnnf12.dropout_input(0, 0, 0)[kUnknown], 
tdnnf13.batchnorm(0, 0, 0) is kUnknown, dependencies: tdnnf13.batchnorm_input(0, 0, 0)[kUnknown], 
tdnnf14.relu(0, 0, 0) is kUnknown, dependencies: tdnnf14.relu_input(0, 0, 0)[kUnknown], 
tdnnf15.affine(0, 0, 0) is kUnknown, dependencies: tdnnf15.affine_input(0, 0, 0)[kNotComputable]tdnnf15.affine_input(0, 3, 0)[kUnknown], 
tdnnf16.linear(0, 0, 0) is kNotComputable, dependencies: tdnnf16.linear_input(0, -3, 0)[kUnknown]tdnnf16.linear_input(0, 0, 0)[kNotComputable], 
tdnnf16.linear(0, 3, 0) is kUnknown, dependencies: tdnnf16.linear_input(0, 0, 0)[kNotComputable]tdnnf16.linear_input(0, 3, 0)[kNotComputable], 
tdnnf16.noop(0, -3, 0) is kUnknown, dependencies: tdnnf16.noop_input(0, -3, 0)[kUnknown], 
tdnnf16.noop(0, 3, 0) is kNotComputable, dependencies: tdnnf16.noop_input(0, 3, 0)[kNotComputable], 
tdnnf11.noop_input(0, 0, 0) is kNotComputable, dependencies: tdnnf10.noop(0, 0, 0)[kNotComputable]tdnnf11.dropout(0, 0, 0)[kUnknown], 
tdnnf12.dropout_input(0, 0, 0) is kUnknown, dependencies: tdnnf12.batchnorm(0, 0, 0)[kUnknown], 
tdnnf13.batchnorm_input(0, 0, 0) is kUnknown, dependencies: tdnnf13.relu(0, 0, 0)[kUnknown], 
tdnnf14.relu_input(0, 0, 0) is kUnknown, dependencies: tdnnf14.affine(0, 0, 0)[kUnknown], 
tdnnf15.affine_input(0, 0, 0) is kNotComputable, dependencies: tdnnf15.linear(0, 0, 0)[kNotComputable], 
tdnnf15.affine_input(0, 3, 0) is kUnknown, dependencies: tdnnf15.linear(0, 3, 0)[kUnknown], 
tdnnf16.linear_input(0, -3, 0) is kUnknown, dependencies: tdnnf15.noop(0, -3, 0)[kUnknown], 
tdnnf16.linear_input(0, 0, 0) is kNotComputable, dependencies: tdnnf15.noop(0, 0, 0)[kNotComputable], 
tdnnf16.linear_input(0, 3, 0) is kNotComputable, dependencies: tdnnf15.noop(0, 3, 0)[kNotComputable], 
tdnnf16.noop_input(0, -3, 0) is kUnknown, dependencies: tdnnf15.noop(0, -3, 0)[kUnknown]tdnnf16.dropout(0, -3, 0)[kUnknown], 
tdnnf16.noop_input(0, 3, 0) is kNotComputable, dependencies: tdnnf15.noop(0, 3, 0)[kNotComputable]tdnnf16.dropout(0, 3, 0)[kUnknown], 
tdnnf10.noop(0, 0, 0) is kNotComputable, dependencies: tdnnf10.noop_input(0, 0, 0)[kNotComputable], 
tdnnf11.dropout(0, 0, 0) is kUnknown, dependencies: tdnnf11.dropout_input(0, 0, 0)[kUnknown], 
tdnnf12.batchnorm(0, 0, 0) is kUnknown, dependencies: tdnnf12.batchnorm_input(0, 0, 0)[kUnknown], 
tdnnf13.relu(0, 0, 0) is kUnknown, dependencies: tdnnf13.relu_input(0, 0, 0)[kUnknown], 
tdnnf14.affine(0, 0, 0) is kUnknown, dependencies: tdnnf14.affine_input(0, 0, 0)[kNotComputable]tdnnf14.affine_input(0, 3, 0)[kUnknown], 
tdnnf15.linear(0, 0, 0) is kNotComputable, dependencies: tdnnf15.linear_input(0, -3, 0)[kUnknown]tdnnf15.linear_input(0, 0, 0)[kNotComputable], 
tdnnf15.linear(0, 3, 0) is kUnknown, dependencies: tdnnf15.linear_input(0, 0, 0)[kNotComputable]tdnnf15.linear_input(0, 3, 0)[kNotComputable], 
tdnnf15.noop(0, -3, 0) is kUnknown, dependencies: tdnnf15.noop_input(0, -3, 0)[kUnknown], 
tdnnf15.noop(0, 3, 0) is kNotComputable, dependencies: tdnnf15.noop_input(0, 3, 0)[kNotComputable], 
tdnnf16.dropout(0, -3, 0) is kUnknown, dependencies: tdnnf16.dropout_input(0, -3, 0)[kUnknown], 
tdnnf16.dropout(0, 3, 0) is kUnknown, dependencies: tdnnf16.dropout_input(0, 3, 0)[kUnknown], 
tdnnf10.noop_input(0, 0, 0) is kNotComputable, dependencies: tdnnf9.noop(0, 0, 0)[kNotComputable]tdnnf10.dropout(0, 0, 0)[kUnknown], 
tdnnf11.dropout_input(0, 0, 0) is kUnknown, dependencies: tdnnf11.batchnorm(0, 0, 0)[kUnknown], 
tdnnf12.batchnorm_input(0, 0, 0) is kUnknown, dependencies: tdnnf12.relu(0, 0, 0)[kUnknown], 
tdnnf13.relu_input(0, 0, 0) is kUnknown, dependencies: tdnnf13.affine(0, 0, 0)[kUnknown], 
tdnnf14.affine_input(0, 0, 0) is kNotComputable, dependencies: tdnnf14.linear(0, 0, 0)[kNotComputable], 
tdnnf14.affine_input(0, 3, 0) is kUnknown, dependencies: tdnnf14.linear(0, 3, 0)[kUnknown], 
tdnnf15.linear_input(0, -3, 0) is kUnknown, dependencies: tdnnf14.noop(0, -3, 0)[kUnknown], 

Daniel Povey

unread,
Oct 31, 2020, 9:59:34 PM10/31/20
to kaldi-help
I'd look further down the debug output, for references the nodes called "input" or "ivector".
Possibly no ivector input was supplied, although I think that might have given a different error.

One thing you could try is to set the batchnorm components to test mode before doing the retraining you tried before (but with a lower than normal learning rate).
It would be helpful to the rest of us if you discover that that makes a difference.
nnet3-copy doesn't have an option to do *just* that.  It has one called "--prepare-for-test" but that one also calls CollapseModel() which may not
be what you want.  You might have to modify the code to add an option to just set batchnorm test mode.

Dan


--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/64d91367-59ac-4a57-b8f0-abdda6c15f0bn%40googlegroups.com.

ma...@nyu.edu

unread,
Oct 31, 2020, 10:36:53 PM10/31/20
to kaldi-help
I would be happy to try that but I don't even understand what the message "is not computable" means (or any of the other output). Does non-computable mean the neural net topology is broken in some fashion? Or the input to it is broken? Or the parameters for training it are way off so it can't compute any derivatives?  It seems to fail immediately - can't get through even one minibatch.  The model (before fine tuning) seems to decode ok. 

Thanks
Michael

Daniel Povey

unread,
Oct 31, 2020, 11:07:49 PM10/31/20
to kaldi-help
There's a dependency graph.. it means either the inputs required to compute the outputs were not available, or the graph had a loop (unlikely if you previously used it OK).

Kirill 'kkm' Katsnelson

unread,
Nov 1, 2020, 7:33:13 AM11/1/20
to kaldi-help
Michael, this error has nothing to do with trainable parameters or gradients. The error happens much earlier, when the graph is compiled. We delay this until the computation device is known. Perhaps it would make sense to precompile the network with "placeholder" dimensions, like e.g. TensorFlow does, and instantiate a concrete network when the values of the placeholders are known, but we just don't do that. The whole compilation is performed for every minibatch size separately (there is not much overhead as they are cached after compilation).

The network definition is normally located in exp/EXPT_NAME/chain/configs or exp/EXPT_NAME/nnet3/configs. You can find it with 'find  exp/EXPT_NAME -type d -name configs'

Your model is copied to the file network.config, then processed in multiple passes where intermediate files are generated, in order: xconfig, xconfig.expanded.1, xconfig.expanded.2, final.config, raw.config, and, finally, the binary file ref.raw. The latter is authoritative. If we really have a bug, you may want to save it to a text file with  'nnet3-info ref.raw >raw.authoritative' (only the network is printed, not parameters, so the output is much smaller than this binary file), or start with final.config, which sits close to the middle of the chain and is the easiest to read, and bisect toward one end or another, depending on your finding. Starting with final.config, the network is expanded into (not-very-)elementary graph node and edge descriptors, and every line of your descriptor may produce 1 to 4 or maybe 5 of these elementary nodes, implemented in C++ code.

Every node in the graph has its input connected with an edge to some output-like entity: either another node's output, or the input layer. An edge can transform data flowing through it. Basically, some input is left unconnected. The names in the message refer to that expanded graph. name.type refers to a component and its output, name.type_input to its input. The lines that begin with 'component' define layers, or graph nodes, 'component-node' and 'dim-range-node' are graph edges, and 'input-node' and 'output-node' are network inputs and output nodes, with one end not connected in the graph: these are the input and the output(s) of the whole kaboodle. It's a misnomer that edges are called nodes in the definition file, but it is how it is. All components have name=, type=, input-dim= and output-dim= arguments, and possibly other parameters (constants or hyperparameters), depending on its type=. All edges define a parameter called 'input=', which contains either a simple name or an expression in a special tensor transform language, consisting of nested functions, but containing only names or constants at the lowest level of nesting. Each such a name names either a component (and thus implicitly a connection to its output), or another edge, which refers to the result of that edge's transformation (the input= expression). Simple 'input=SOME_NAME' is an identity transformation. 'dim-range' is a dimension-splitter edge, which takes a contiguous column range if its input, and has two additional arguments for the index offset and extracted length. The input node is always named 'input', and is the only one that is not an input to any other node; it represents the input of the whole graph.

Something is not right with this graph. I think you may sort the error trace alphabetically, and something will stand out. What I do not see in this trace is an input named "input" (not x.y_input, that's an input of a node x.y, but simply input, referring to the input node conventionally named 'input'.

The compilation builds a graph that computes given outputs from the given input:

    Computation request was:  # Computation request:
    input-0: name=input ...
    output-0: name=output, ...
    output-1: name=output-xent, ...

but the name input does not appear in any dependencies listed in the error. That is not right.

For example, in a final.config file I am looking at right now, there are declarations
  1. input-node name=input dim=40
  2. component-node name=tdnn1.affine component=tdnn1.affine input=Append(Offset(input, -1), input, Offset(input, 1))
  3. component name=tdnn1.affine type=NaturalGradientAffineComponent input-dim=120 output-dim=512
The line 2 declares an edge connecting transforming three consecutive inputs of the network (the 40×3 shape) into a row-first flattened 120×1 vector with the Append() function to the input of an instance of the NaturalGradientAffineComponent: 120x1 -> 512x1 named tdnn1.affine. This component can be computed, and will need to be computed if its output is connected to an input of another component that both needs to be computed and can be computed. This recurrence ends at an output that needs to be computed because the computation request tells so. Thus, the computation recipe dependencies are evaluated backwards, starting at the output and finishing (if everything goes fine) at the input. This is needed to calculate the input shape. Think of the Append() above that requires input row dimension be greater than that of any output. i.e. adds an extra left and right context.

Now, if I remove the second line declaring the connection from the input to the component named tdnn.affine, the component becomes uncomputable. And it's requested if any component connected to its output needs to be computed in the sense defined above. Note that if a component is uncomputable, all components depending on it also become uncomputable. You see a lot of uncomputable nodes and edges in the ExplainWhyNotComputable message, an indication that the problem is close to the input of the graph.

Makefile can serve as an analogy: to compile (=compute) one or more outputs you need to compile (=compute) its dependencies, and in turn their dependencies, until the chain of dependencies traversing from the outputs to the inputs ends at existing source files (=network inputs). If source files (the inputs) exist, but there is no Makefile rule (=a graph edge), the computation fails.

 -kkm

ma...@nyu.edu

unread,
Nov 2, 2020, 9:19:01 AM11/2/20
to kaldi-help
Thanks very much for the detailed explanation. Since this is an adaptation experiment, I am starting from a fully-trained librispeech model. If I look at train.py, it looks like it does not use the configuration file at all. I attach the output of nnet3-info for 0.mdl, it looks "good" to me (not that I know much, of course). 

Is it possible the input (which I gather are the cegs files) are messed up in some fashion? Is there any other debugging info I can get the training program to print out that would be helpful?

Thanks
Michael
0.pdf

Kirill 'kkm' Katsnelson

unread,
Nov 2, 2020, 10:57:00 PM11/2/20
to kaldi-help
Looks like you are missing the 'ivector' input in cegs. From the PDF you attached (a plain text would have been easier, really)

input-node name=ivector dim=100
input-node name=input dim=40
...
output-node name=output
output-node name=output-xent ...

but  the computation request seems to have no data for the input named 'ivector', only for 'input'

LOG (nnet3-chain-train[5.5.811~1-bcd163]:ExplainWhyAllOutputsNotComputable():nnet-computation-graph.cc:355) Computation request was:  # Computation request:
input-0: name=input, ...
output-0: name=output, ...
output-1: name=output-xent, ...

Too bad the ExplainWhy... message is cut. Was it like this, or did you not post it whole? If it's cut, we certainly need to fix it, it says nothing as it is, and is only confusing.

 -kkm

ma...@nyu.edu

unread,
Nov 3, 2020, 10:52:02 AM11/3/20
to kaldi-help
Thank you so much!!!

Now it is training!!!! No more uncomputable messages!!!

What happened was I misspelled a shell variable; unfortunately, unlike python, bash is very happy with such things and passed a blank directory to train.py. 

Best
Michael

Kirill 'kkm' Katsnelson

unread,
Nov 4, 2020, 5:01:45 AM11/4/20
to kaldi-help
Michael, glad it helped. Tour case is not easy to quickly repro, so I need your help too. This message:

LOG (nnet3-chain-train[5.5.811~1-bcd163]:ExplainWhyNotComputable():nnet-computation-graph.cc:172) *** cindex output(0, 0, 0) is not computable for the following reason: ***
output(0, 0, 0) is kNotComputable, dependencies: output.affine(0, 0, 0)[kNotComputable],
<snip>
tdnnf14.affine_input(0, 3, 0) is kUnknown, dependencies: tdnnf14.linear(0, 3, 0)[kUnknown], 
tdnnf15.linear_input(0, -3, 0) is kUnknown, dependencies: tdnnf14.noop(0, -3, 0)[kUnknown], 

is clipped at the tail. Did you post a complete message output by Kaldi, or was it just clipped by your terminal or something? Because if it's all that Kaldi output, that's a bug that needs to be fixed.

A complete message would end with something like "component lda is not computable because input.ivector is not computable", and it would save us a lot of time. A snipped message is only misleading: you read it, then copy to an editor, sort lines, toposort in your head, and finally see the "root" cause: " tdnnf10.noop_input(0, 0, 0) is kNotComputable, dependencies: tdnnf9.noop(0, 0, 0)[kNotComputable]..." But why on Earth tdnnf9.noop is not computable??? Noop is the easiest computation of all: you do nothing, and it computes itself. You see, while a chopped message is only confusing, while a full message is very helpful.

This is why I'm asking whether there ever was a full message, or Kaldi chopped it.

 -kkm

ma...@nyu.edu

unread,
Nov 5, 2020, 10:38:00 AM11/5/20
to kaldi-help
The above message repeated 10 times with no message of the sort you describe above. I just copied the first repetition to avoid flooding the output. 

Best and thanks!
Michael

Reply all
Reply to author
Forward
0 new messages