Fundamental frequency calculation with librosa and parselmouth of long audio

80 views

Skip to first unread message

Pia Schneider

unread,

Apr 23, 2024, 6:29:49 AM4/23/24

to Parselmouth

Hello,

I'm currently analyzing the fundamental frequency of relatively lengthy audio recordings (about 7 min each), which consist of conversations between two individuals. I've employed both librosa and parselmouth for the analysis. I observe a discrepancy in the length of the resulting fundamental frequency array:

librosa:

audio_file = 'Audio_files/VP01.wav' y, sr = librosa.load(audio_file) f0, voiced_flag, voiced_probs = librosa.pyin(y, fmin=librosa.note_to_hz('C2'), fmax=librosa.note_to_hz('C7'))

len(f0) # This yields 21496

parselmouth:

python

snd = parselmouth.Sound("Audio_files/VP01.wav")

pitch = snd.to_pitch()

len(pitch.selected_array['frequency']) # This gives 49908

Could you direct me to the source of this difference and advise me on which method I should rely on? I'm new to audio analysis.

Moreover, I need to determine the temporal correspondence of these fundamental frequency values. Given that my recordings feature dialogues between two individuals, my objective is to filter out the segments where the second participant speaks, thereby retaining only the fundamental frequency values of the first participant. Subsequently, I aim to compute means over specific time windows, such as 20 or 30 seconds.

To achieve this, I need clarity on the "sampling frequency" (sorry for the potentially inaccurate term) of the f0 values.

Thank you for your help!

Pia

yannick...@gmail.com

unread,

Apr 23, 2024, 11:09:12 AM4/23/24

to Parselmouth

Hi Pia

I'm not an expert on the different pitch tracking that exist, but librosa and Praat implement several different ones, so differences are to be expected.

Moreover, all these algorithms have a number of parameters determining the number of estimates produced. In this case, for librosa, frame_length, win_length, hop_length are probably very relevant to the number of pitch estimates. Praat has some very similar ones, being Time step (s) and Pitch floor (Hz) (available as time_step and pitch_floor parameters in Parselmouth).

Both librosa and Praat mention references for the respective algorithms and document the different parameters: see https://librosa.org/doc/main/generated/librosa.pyin.html and https://www.fon.hum.uva.nl/praat/manual/Sound__To_Pitch__raw_ac____.html

I can't tell you exactly which one is right. It will probably depend on characteristics of the specific algorithms, on the precision you need, etc. Praat does tend to have good default parameters for human speech, in my experience. But I have no clue whether these defaults transfer well to the pYIN algorithm.

Regarding getting the corresponding times, yes, that should be pretty straightforward with Parselmouth. In the plotting example in the documentation (https://parselmouth.readthedocs.io/en/stable/examples/plotting.html) you can see pitch.xs() being used to plot. Alternatively, you can access the pitch.x1, pitch.nx, pitch.dx properties, or use pitch.frame_number_to_time(i) (Praat uses 1-based frame numbers!) to explicitly convert a frame number (index + 1) to a time.

I hope this helps!

Kind regards

Yannick

Reply all

Reply to author

Forward

0 new messages