I noticed the CNN-TNN chain model (obtained using "local/chain/tuning/run_cnn_tdnn_1a.sh" script) has two output blocks, "prefinal-chain" and "prefinal-xent", both of them having the same input component, "prefinal-l". The first one is ending with an affine layer, while the second one has a softmax as the final layer. In an older post here , I found that there are two output branches, one for training (prefinal-xent) and one for decoding (prefinal-chain).
Can you elaborate a bit how are they used?
Why are two branches required?
It is a bit unclear for me how the prefinal-chain, which is not ending with a softmax layer, could be used in the decoding step. I thought that decoding supposes to get posterior probabilities for the acoustic states and this thing is provided by the softmax.