Number of pitch frames after to

P “The Igeeit” Lam

unread,

Jul 2, 2023, 10:08:14 AM7/2/23

to Parselmouth

Hello Yannick,

Thanks for making this library!

I notice that after calling sound.to_pitch, the number of frames in the result is not total_samples / hop_length

but rather

total_samples / hop_length - 3

When using very_accurate=True i get

total_samples / hop_length - 6

It seems like this is related to window length: "Very accurate (standard value: off) -- if off, the window is a Hanning window with a physical length of 3 / (pitch floor). If on, the window is a Gaussian window with a physical length of 6 / (pitch floor), i.e. twice the effective length."

I am guessing the pitch extractor starts counting the first frame from 1.5 frames into the signal (for very_accurate=False).

I need to match the frame count to total_samples / hop_length to align with mel-spectrograms.

Currently i'm padding 1.5*hop_length to both ends of the signal before extracting pitch. Is this solution correct?

Code:

hop_length = 256

def check_f0_praat(audio, fs, very_accurate=False):

multiple = 3.5 if very_accurate else 1.5

padding = int(multiple*hop_length)
x = np.pad(audio, (padding, padding))
sound = parselmouth.Sound(x, fs)
time_step = hop_length / fs
pitch = sound.to_pitch_ac(time_step=time_step, very_accurate=very_accurate)
f0 = np.array([p[0] for p in pitch.selected_array])
return f0

Regards

Perry

yannick...@gmail.com

unread,

Jul 3, 2023, 9:45:11 AM7/3/23

to Parselmouth

Hi Perry!

The reason for this is that Praat needs a full window before being able to estimate the first pitch estimate. So, if the window is 3 / (pitch floor), the middle of the first window is indeed 1.5 / (pitch floor), and that's independent of the hop size.

Seems you already found this, but here's Praat's documentation on the function: https://www.fon.hum.uva.nl/praat/manual/Sound__To_Pitch__ac____.html

The lack of control over these exact points at which the analyses are done, is sometimes annoying, yes. Especially when matching up different analyses. I'm afraid I don't have the exact solution here (and it's sort of out of my control, as I aim for Parselmouth to be a reasonably transparent Python interface to Praat), but I have a few suggestions/notes.

- I would say there are indeed 2 ways to match up the time steps. Either you sort-of reverse engineer Praat's calculation of time steps (happy to help, and point you to the right piece of code; I'll link below to the calculation for Pitch), or you interpolate between estimates (`pitch.get_value_at_time(...)` will do this for you, and even allow you to control interpolation: https://parselmouth.readthedocs.io/en/stable/api_reference.html#parselmouth.Pitch.get_value_at_time).

- Note that adding this zero-padding to the front and back has two caveats:

1) your first/last sliding window position(s) will be half-filled with zeros, so your estimates won't be as robust at the start and end;

2) you might get the same amount of samples, but that doesn't mean that the actual times of these estimates perfectly match up. Compare the `pitch.xs()` and `mel_spectrogram.xs()`, as they will give you the center of the sliding window for all samples (modulo the padding that you're adding, of course!). From this respect, the interpolation might be the safe choice as well, to be sure you always have the same time offsets for all samples.

Here's the calculation of the window length and sample times: https://github.com/YannickJadoul/Parselmouth/blob/04b7b25244bab31b987f20956d29fea275d231b3/praat/fon/Sound_to_Pitch.cpp#L309-L397

I hope this helps. I'm afraid I don't have a silver bullet, but if you keep the sliding window approach in mind, it should clarify things and show how to proceed with caution?

If there's anything else, don't hesitate to ask!

Kind regards

Yannick

P “The Igeeit” Lam

unread,

Jul 3, 2023, 11:33:38 AM7/3/23

to Parselmouth

Hi Yannick,

Thank you for the detailed response!

I was asking this in reference to https://github.com/espnet/espnet/issues/5259#issuecomment-1615945715 (trying to align with ESPnet's melspec computation).
It appears that ESPnet calls librosa.stft when computing melspec and pads the signal to n_fft, which is what i did for to_pitch.

Interpolation is probably the right way to do it -- i think i will use it for prosody transfer between audio with different lengths.

Regards

Perry

yannick...@gmail.com

unread,

Jul 3, 2023, 5:48:34 PM7/3/23

to Parselmouth

Hi Perry!

Good to hear. And interesting to see the application, thanks!

If it helps, I do have a pure Python (and a C++-accelerated) version of Praat's pitch tracking algorithm, which I reimplemented at some point. If you're really invested, you might be able to tweak the window length and step size such that it gives the same sampling as Praat. Also, you wouldn't need to have the whole of Parselmouth as dependency (although I am obviously a fan of Parselmouth, I realize it's a heavy dependency, bringing in all of Praat just for the pitch analysis).

Maybe I should at some point get round to releasing this standalone reimplementation, but then I need to somehow maintain it, etc. Yet still, if it would be helpful in the context of that ESPnet issue/PR, I'm happy to email it to you!

Number of pitch frames after to_pitch

P “The Igeeit” Lam

yannick...@gmail.com

P “The Igeeit” Lam

yannick...@gmail.com