I was studying the pitch extraction method used in Kaldi, or specifically, the piped command `compute-kaldi-pitch-feats ... | process-kaldi-pitch-feats ...`.
With "--add-raw-log-pitch" added, I visualized the four dimensional pitch by this command as follows. Mel-spectrogram is also shown as a reference. I understand the other three dimensions well, but I suspect that the value of POV feature extracted here is contrary to the meaning "probability of voicing". In the spectrogram, there are several positions where the sound is entirely an unvoiced phone (e.g. frame 60 for 's', frame 220 for 'dg'), but the POV is high there.
I might be a newbie in speech processing, but shouldn't POV be high where there is voicing vowel? I read the corresponding paper "A PITCH EXTRACTION ALGORITHM TUNED FOR AUTOMATIC SPEECH RECOGNITION" by D.Povey et al., but it deepened my confusion.
Thanks in advance!