ark,t: can be used to get the result in text
This code segment does the job.
# Get ids for pure phones instead of canonical ones
local/
remove_phone_markers.pl $lang_dir/phones.txt \
data/phones-pure.txt data/
phone-to-pure-phone.int
# Get alignment for frame - transition-ids mapping
steps/nnet2/align.sh --nj 1 ./data $lang_dir $mdl_path exp/nnet_ali
# Get frame - canonical_phone mapping
$cmd JOB=1:$nj exp/ali_test/log/ali_to_phones.JOB.log \
ali-to-phones --per-frame=true $mdl_path/final.mdl \
"ark,t:gunzip -c exp/nnet_ali/ali.JOB.gz|" \
"ark,t:|gzip -c >exp/nnet_ali/ali-phone.JOB.gz"
# Get the posteriors (a matrix - no of rows is frames, and column values are senone emission probabilities)
nnet-am-compute $mdl_path/final.mdl "ark,s,cs:copy-feats scp:data/feats.scp ark:- |" ark:data/posteriors.ark
# GoP result of compute-gop in gop.1.txt
$cmd JOB=1:$nj exp/gop_test/log/compute_gop.JOB.log \
compute-gop --phone-map=data/
phone-to-pure-phone.int $mdl_path/final.mdl \
"ark,t:gunzip -c exp/nnet_ali/ali-phone.JOB.gz|" \
"ark:data/posteriors.ark" \
"ark,t:exp/gop/gop.JOB.txt" "ark,t:exp/gop/phonefeat.JOB.txt"
Effectively,
compute-gop computes the difference between the forced alignment phase
and free recognition phase (you see 'max' in the denominator of gop, and
in free recognition phase model would anyways choose the max or best
possible state sequence that it believes). So GoP value closer to zero, better is
the pronunciation.