feature.tempo returns garbage results.

155 views
Skip to first unread message

In Cognite

unread,
Mar 14, 2024, 6:41:13 AM3/14/24
to librosa
Help, I have been trying for 2 days to get decent bpm estimation out of librosa but I can't figure out why it gives me fluctuating bpm values no matter what I try.

When I use feature.tempo, or get it trough beat.beat_track, no matter which custom options I try, playing around with the window size, hop lengths, anything I could find in the docs, pre-processing the signal, ... I keep getting garbage results. 
One thing I noticed is that certain very specific values keep returning:

123.046875
129.19921875
178.20581896551724

It's like it just randomly returns one of this list of special numbers. The only thing that changes a bias in which of those numbers get returned seems to be the start_bpm argument.

The input array I use is the last x seconds of a buffered audio device stream, so semi-realtime and gets called once every second. I'm using 4/4 electronic tracks with a clear and strong beat, I even tried running ableton with a click track or just playing a kick on every beat. The array is simply a list of float values between -1 and +1 converted to a numpy array.

This is driving me insane -_-

Brian McFee

unread,
Mar 14, 2024, 6:48:37 AM3/14/24
to librosa
This is difficult to diagnose without seeing your code, but one question that comes to mind is how long of a buffer are you using here?  The tempo estimator works by computing an onset strength envelope, correlating it against itself, and looking for a peak in the neighborhood of where we expect the tempo to be.  If your buffer is too short, the autocorrelation will be unreliable and you might get spurious results as described.  If you're only analyzing 1 second of audio at a time, that's quite likely the problem.

Another more subtle problem that can occur when doing stream-based analysis is that you can introduce discontinuities to the signal when concatenating non-overlapping buffers.  These discontinuities can look like transients (and therefore, onsets) to the subsequent analysis, and also corrupt your results.  Take a look at the librosa.stream documentation to understand how we handle this when spooling large files from disk.  Librosa.stream won't be applicable in your case, since you're operating on a live capture, but it should at least give you a sense of the kinds of things to look out for.

In Cognite

unread,
Mar 14, 2024, 10:19:17 AM3/14/24
to librosa
I'm running the code in TouchDesigner (the final goal is to send out bpm and other analysis information over OSC to other TouchDesigner instances to sync them with live music)
So the stream from the audio hardware goes into a rolling buffer. TouchDesigner provides the buffer data channel as a numpy array of floats so it's guaranteed to be consistent.

I tried buffer sizes anywhere between a couple seconds up to 10 seconds (10 seconds should provide enough kick onsets to calculate a bpm on most EDM music)
for hop_length I tried 512, 1024, 2048, 256, 128, 64 etc
The audio I tried at 44100 Hz as well as 22050 Hz
For start_bpm I tried anything from well below the ground truth (known bpm of a track) to just below it. Librosa also does not respect start_bpm and will often return lower values.

The code is very simply:
 foundbpm, _ = librosa.beat.beat_track(y=y, sr=sr, hop_length=hop_length, start_bpm=mintempo)

Alternatively I tried for better results with:
oenv = librosa.onset.onset_strength(y=y, sr=sr, hop_length=hop_length)
foundbpm = librosa.feature.tempo(y=y, sr=sr, start_bpm=mintempo)

Then I experimented with all the arguments that those functions can take, tried pre-filtering and denoising the audio. 

Brian McFee

unread,
Mar 14, 2024, 10:26:47 AM3/14/24
to librosa
Ok, 10 seconds ought to be enough for any reasonable tempo.

You're definitely better off calling the tempo estimator directly rather than going through the beat tracker.  (It mostly won't matter, but the beat tracker incurs a lot of overhead that you don't need here.)  Your code example has a couple of problems though: if you're going to pre-compute the onset envelope, you would have to pass that as input to the tempo estimator, rather than the signal again.  Additionally, if you're changing the hop length, that needs to be passed into the tempo estimator as well.  For example:

```
oenv = librosa.onset.onset_strength(y=y, sr=sr, hop_length=hop_length)
bpm = librosa.feature.tempo(onset_envelope=oenv, sr=sr, hop_length=hop_length, start_bpm=mintempo)
```

Regarding "start_bpm" - this parameter is used to inform the tempo estimator about what the tempo should be.  The actual detected tempo may be either faster or slower than start_bpm, and that's totally fine.  In the context of a rolling buffer, once you shake out the rest of the bugs in your implementation, you could even use the previous detection as the start_bpm for the next analysis, and that should give you some continuity in your estimates.

In Cognite

unread,
Mar 14, 2024, 1:31:19 PM3/14/24
to librosa
Oh my copypasta mistake, I did use oenv in tempo originally, as well as in beat_track. I also tried using plp like such:


oenv = librosa.onset.onset_strength(y=y, sr=sr, hop_length=hop_length)
plp = librosa.beat.plp(sr=sr, onset_envelope=oenv, hop_length=hop_length, win_length=win_length, tempo_min=mintempo)
foundbpm = librosa.feature.tempo(sr=sr, onset_envelope=plp, start_bpm=mintempo, max_tempo=mintempo+100)

as well as skipping onset_ strength and providing y to plp directly.

To illustrate, I have used librosa.clicks to compare the detected features with the source signal.
Oh it seems I can't post screenshots here of the resulting plots, so I will try in words:
When processing the rolling buffer a few times at steady intervals,  (the last 10 seconds every 1 second) I notice that the length of either oenv, plp or beats varies (I assume this is normal as the length is just the amount of features found and the values are their positions)
BUT when I use them in beat_track and feed the beats to librosa.clicks, I never get a click track back with anywhere near 10 seconds, and each time the length and the number of clicks is different but never anywhere near the number of visible kickdrum peaks in the signal. The clicks are spaced evenly apart which I think is as expected, but sometimes it will show a series of evenly spaced clicks with some random occurences missing (no click where one would be expected) even when the input clearly shows a stead beat with clearly visible strong peaks on the beat.
 

MangoTherapy

unread,
Mar 14, 2024, 1:31:26 PM3/14/24
to librosa

I also wrote about this problem earlier, BPM is considered incorrect, and substitutes some template values such as 123.046875.

I have a database of 3000 tracks, and the pattern of value patterns is obvious there
четверг, 14 марта 2024 г. в 17:26:47 UTC+3, brian...@nyu.edu:

Brian McFee

unread,
Mar 14, 2024, 1:41:40 PM3/14/24
to librosa
On Thursday, March 14, 2024 at 1:31:19 PM UTC-4 kiyos...@gmail.com wrote:
Oh my copypasta mistake, I did use oenv in tempo originally, as well as in beat_track. I also tried using plp like such:

oenv = librosa.onset.onset_strength(y=y, sr=sr, hop_length=hop_length)
plp = librosa.beat.plp(sr=sr, onset_envelope=oenv, hop_length=hop_length, win_length=win_length, tempo_min=mintempo)
foundbpm = librosa.feature.tempo(sr=sr, onset_envelope=plp, start_bpm=mintempo, max_tempo=mintempo+100)

as well as skipping onset_ strength and providing y to plp directly.


I wouldn't recommend using PLP to estimate tempo.  The PLP method discards quite a bit of information up front, as it implicitly estimates a dominant tempo at each frame, so any mistakes early on will propagate through to the resulting analysis.  You're almost certainly better off with tempo derived directly from onset strength.
 
To illustrate, I have used librosa.clicks to compare the detected features with the source signal.
Oh it seems I can't post screenshots here of the resulting plots, so I will try in words:
When processing the rolling buffer a few times at steady intervals,  (the last 10 seconds every 1 second) I notice that the length of either oenv, plp or beats varies (I assume this is normal as the length is just the amount of features found and the values are their positions)

This sounds incorrect, and you may be misinterpreting what's going on here.  If the duration of the input buffer is fixed (10 seconds at whatever sampling rate you choose), then the length of oenv will be fixed as well.

The "amount of features" here is 1: you have a single value at each frame index corresponding to the strength of note onsets at that frame index.  The values in the oenv array are not positions, they are strength measurements.

The example illustrated in  https://librosa.org/doc/latest/generated/librosa.onset.onset_detect.html should give you a picture of what's going on here: ../_images/librosa-onset-onset_detect-1.png

Onset strength is the blue curve, the spectrogram is displayed above, and the horizontal axis is time (converted from frame indices).

If the length of your onset envelope is changing, this says to me that something is going wrong in your buffer concatenation.
 
BUT when I use them in beat_track and feed the beats to librosa.clicks, I never get a click track back with anywhere near 10 seconds, and each time the length and the number of clicks is different but never anywhere near the number of visible kickdrum peaks in the signal.

Again, this suggests misinterpretation of the data.  Make sure that the beat times you're sonifying with the clicks function are in the proper units.  Both the beat tracker and clicks function can index by sample, time (seconds), or frame index, but you need to make sure that this is implemented consistently.

The example that merged this morning for variable-tempo tracking should give you a starting point for click sonification: https://librosa.org/doc/main/auto_examples/plot_dynamic_beat.html#sphx-glr-auto-examples-plot-dynamic-beat-py

Brian McFee

unread,
Mar 14, 2024, 1:44:56 PM3/14/24
to librosa
On Thursday, March 14, 2024 at 1:31:26 PM UTC-4 MangoTherapy wrote:

I also wrote about this problem earlier, BPM is considered incorrect, and substitutes some template values such as 123.046875.


Without seeing the data (and code) that causes this, there's no way to know what's going on, but it's very unlikely that the tempo estimator would spit out "template values" as you describe.  I suggest to dig into the onset envelope calculation and see what the tempo estimator is working with.  The code examples in https://librosa.org/doc/main/generated/librosa.feature.tempogram.html may be of some help with that.

In Cognite

unread,
Mar 14, 2024, 2:47:12 PM3/14/24
to librosa
oenv problem.jpg
Found the image upload button :)

So in my debugging, I can see that the length of y remains fixed, I am 100% sure that CHOP channels in TouchDesigner don't go suddenly changing their number of samples when taking them as a numpy array in python.
For the capture above, the signal (y) is exactly the same as the input signal. 
oenv here is generated by:

oenv = librosa.onset.onset_strength(y=y, sr=sr, hop_length=hop_length)
As you can see the resulting oenv array has 26 samples in this case, the numbers vary but it's always a small number.
(When I feed it pure silence, the length becomes 0 and python gives an error (because array can not be length of 0))

Then I pass it to beat_track to be able to create the click track from the generated beats array (which I 
foundbpm, beats = librosa.beat.beat_track(onset_envelope=oenv, sr=sr, hop_length=hop_length, start_bpm=mintempo)
clicks = librosa.clicks(frames=beats, sr=sr)

I also would not recommend using plp but I did use it, because the standard ways do not produce the expected results and I am desperately just trying everything I can find in the docs in my quest to find out why it's not working.

Brian McFee

unread,
Mar 14, 2024, 2:54:56 PM3/14/24
to librosa
On Thursday, March 14, 2024 at 2:47:12 PM UTC-4 kiyos...@gmail.com wrote:
oenv problem.jpg
Found the image upload button :)

So in my debugging, I can see that the length of y remains fixed, I am 100% sure that CHOP channels in TouchDesigner don't go suddenly changing their number of samples when taking them as a numpy array in python.
For the capture above, the signal (y) is exactly the same as the input signal. 

Two things here: first, please verify that the sample values in y actually are what you expect them to be - plot them from the python side, not in touchdesigner.  Or at least save out the buffer and load it back up with another tool to make sure that nothing is going wrong at the signal acquisition point, otherwise nothing afterward will make sense.

Second, what is the samping rate of your signal?  If that's 10 seconds of audio in those plots, and the numbers in the plot axes correspond to samples, that would mean a sampling rate of something like 1300, which is way too low for the kind of analysis you're doing.

In Cognite

unread,
Mar 14, 2024, 4:13:11 PM3/14/24
to librosa
First of all, thank you for taking your time to help out a stranger with this problem.

Python in TouchDesigner is an integrated environment. I am just saving out the same numpy array that I pass to onset_strength.
Don't get blinded by the numerical details of the example, as I stated before, I have been experimenting a lot, this is just a random screenshot of one of the 100's of tries.
I guess in this one I might have tried to downsample the input to a factor 20. In any case -and I can't repeat this enough- it's NOT a single 'special' situation dependent on a specific input state.

Meanwhile I have replaced the librosa function in my code, 
I can get good bpm's from the same input when I apply a low pass filter, then capture the instant signal power envelope and downsample to 400hz,  and then use numpy's correlate and scipy's find_peaks
Then when I map the resulting peak position values onto an array with the same length as the input signal, I can verify the pulses line up with the original waveform.
Sadly the accuracy of my amateur solution suffers from the extreme downsampling (and without it numpy.correlate becomes too slow for realtime) and it only works on kickdrum-focussed tracks where there's not too much movement in the bass and sub frequencies from a funky bassline or such.

At this point I'm ready to give up on trying to use librosa. It's a very cool idea and the bpm function seemed very easy to use in theory (insert signal, insert sample rate, get correct bpm).
All I can think of to continue troubleshooting is when I go and export buffers to wav files and then adapt the code to run outside of TD and then learn to use matplotlib and more python to write a testing suite there... that's just an insane amount of work.
So all I can do is give the specifications of what I have tried, the code I have tried and my findings. Which I hope can help you in further development.

Brian McFee

unread,
Mar 14, 2024, 4:21:56 PM3/14/24
to librosa
On Thursday, March 14, 2024 at 4:13:11 PM UTC-4 kiyos...@gmail.com wrote:
First of all, thank you for taking your time to help out a stranger with this problem.


Happy to (try to) help!  Obviously, something in the documentation is not clear enough, and I'd like to get to the bottom of this so I can determine where to improve things.
 
Python in TouchDesigner is an integrated environment. I am just saving out the same numpy array that I pass to onset_strength.
Don't get blinded by the numerical details of the example, as I stated before, I have been experimenting a lot, this is just a random screenshot of one of the 100's of tries.
I guess in this one I might have tried to downsample the input to a factor 20. In any case -and I can't repeat this enough- it's NOT a single 'special' situation dependent on a specific input state.

Sure, I'm not suggesting that you're hitting some weird special case.  But "numerical details" are what it's all about here - if the example you provide is not representative of what you're actually doing, and we don't have a complete piece of code to look at or the audio you're analyzing, it's nearly impossible to tell you where it's going wrong.

I'm quite confident that the code (ie librosa tempo estimator) is not fundamentally broken, and the most likely explanation is either something going wrong in the signal acquisition, or a misinterpretation of the data going into / coming out of librosa.

I'd appreciate it if you could provide a concise code example (with all relevant parameters specified; sampling rate, hop length, etc) and a screenshot of the data that you get from it.
 

Meanwhile I have replaced the librosa function in my code, 
I can get good bpm's from the same input when I apply a low pass filter, then capture the instant signal power envelope and downsample to 400hz,  and then use numpy's correlate and scipy's find_peaks
Then when I map the resulting peak position values onto an array with the same length as the input signal, I can verify the pulses line up with the original waveform.
Sadly the accuracy of my amateur solution suffers from the extreme downsampling (and without it numpy.correlate becomes too slow for realtime) and it only works on kickdrum-focussed tracks where there's not too much movement in the bass and sub frequencies from a funky bassline or such.

At this point I'm ready to give up on trying to use librosa. It's a very cool idea and the bpm function seemed very easy to use in theory (insert signal, insert sample rate, get correct bpm).
All I can think of to continue troubleshooting is when I go and export buffers to wav files and then adapt the code to run outside of TD and then learn to use matplotlib and more python to write a testing suite there... that's just an insane amount of work.

It might be less work than you think.  There's tons of example code in the librosa docs that you should be able to copy-paste with minimal modifications and get some diagnostic outputs.  I think this is worth trying, if only because simplifying your environment is often the first step toward identifying the problem.

In Cognite

unread,
Mar 15, 2024, 10:31:54 AM3/15/24
to librosa
Ok one more try. I have attached 2 audio clips of 8 seconds, both from the same track.
for each case I ran the code you can see in the images, where I run feature.tempo with the same input and a range of hop lengths
None of the found tempos are near the ground truth of 180.0 bpm
librosa_hops_test2.jpg
clip2_8s.wav
clip_8s_spectogram.jpg
clip_8s.wav
clip2_8s_spectogram.jpg
librosa_hops_test.jpg

Brian McFee

unread,
Mar 15, 2024, 10:53:41 AM3/15/24
to librosa
Ok, I think this is actually working as expected, and it's just a question of the default parameters for librosa not being optimal for this style of music.  The following is using sr=44100 and hop=512.

If you plot the tempogram over time for clip2, it looks as follows:
Screenshot from 2024-03-15 10-36-11.png
(red = high energy at said tempo, blue = low energy)

If you aggregate that over time, it looks as follows:

tempo_curve.png
The highest peak is indeed around 180 (180.129).  Note that there are also strong peaks at subdivisions of the beat: 120 (3/4), 90 (1/2) etc; this is coming from the strongly regular pulse of the track.

What's going on here is that by default, librosa does not use the raw tempo curve, but applies a weighting to it that biases toward an expected tempo of 120 (or whatever start_bpm is).  The following figure shows the tempo curve, the weighting, and the result of multiplying the two:

weighted_curve.png
The green dashed curve is what's being used for tempo estimates; as you can see, it's suppressing the peak at 180 in favor of the peak at 120, which is closer to what you would expect to see (statistically) in most popular music.

This behavior is easy enough to change in a variety of ways.  The simplest is to just make the std_bpm parameter larger; the default value is 1, but larger values will spread the weighting curve out more:
weighted_curve_std4.png
This was enough to preserve the peak at 180 for this clip, and ought to work in general for you.  Remember that the default parameters in librosa are meant to work well "on average", but they might need a little tweaking to work optimally in specific situations.

Now, to go back to the original post, the numbers it was selecting for tempo were almost certainly not random "special numbers", but rational subdivisions of the true tempo of your piece that just happened to look statistically more probable under the default parameterization.

In Cognite

unread,
Mar 18, 2024, 3:42:56 PM3/18/24
to librosa
I'm still confused about the tempogram.

When I run this line with different hop lengths:

oenv = librosa.onset.onset_strength(y=y, sr=sr, hop_length=hop_length)

The output is:
------------------------
hop length: 512
signal shape: (13230,)
oenv shape: (26,)
------------------------
hop length: 256
signal shape: (13230,)
oenv shape: (52,)
------------------------
hop length: 128
signal shape: (13230,)
oenv shape: (104,)
------------------------
hop length: 64
signal shape: (13230,)
oenv shape: (207,)
python >>>


Clearly the oenv shape is not one sample per signal sample, and changes according to the hop length. I want to plot the tempogram but I'm not sure how the oenv data can be interpreted and plotted?

Brian McFee

unread,
Mar 18, 2024, 3:45:15 PM3/18/24
to librosa
That's right - if you reduce the hop length (from 512 to 64), then your frame rate increases, so you'll have more frames for the same number of samples.

Refer to https://brianmcfee.net/dstbook-site/content/ch09-stft/intro.html for a detailed explanation of how this kind of analysis works in the context of short-time fourier transforms.

In Cognite

unread,
Mar 19, 2024, 10:13:34 AM3/19/24
to librosa
Ok I see where I was confused in the terminology. So the number of samples in oenv is equal to the number of windows analyzed by the mel spectogram function, and the response for each hop is the result of retreiving this feature value for each window.
I guess if I change the hop length , it would make sense to dynamically find an optimal value for win_length as well? Perhaps depending on the target bpm range. I can see that win_length defaults to n_fft size, which does not seem do default to anything, yet the function works without providing a value for it, so I'm not sure from which value to start experimenting. 

But I still don't get how the oenv relates to the stft. My understanding of the science side of signal analysis is very rudimentary as I am approaching this from a creative coder/musicin perspective, so please bear with me:
As I understand it so far:
> The tempogram is basically like a spectogram we have in DAW's and media players, with a vertical resolution equal to the number of fft bins, and a horizontal resolution of 1 sample per processed window.
> But instead of getting high values in the fft bins for excitement in the corresponding pitch frequencies, it uses much larger signal windows to get results in the rhythmic frequencies range.
> The onset envelope is not the same as the tempogram, it does not have a vertical resolution.
 
Question 1: then how did you plot a tempogram from the onset envelope like that?
Question 2: When you say 'aggregate that over time', that sounds to me like collapsing the vertical dimension by summing the values of each column, which would return again an onset envelope (as each horizontal row in the tempogram is basically an onset envelope of a single frequency bin) - but instead you get this 'tempo curve', but I can't find anything about that in the docs.


Brian McFee

unread,
Mar 19, 2024, 10:33:09 AM3/19/24
to librosa
On Tuesday, March 19, 2024 at 10:13:34 AM UTC-4 kiyos...@gmail.com wrote:
Ok I see where I was confused in the terminology. So the number of samples in oenv is equal to the number of windows analyzed by the mel spectogram function, and the response for each hop is the result of retreiving this feature value for each window.

Yes, that's correct.
 
I guess if I change the hop length , it would make sense to dynamically find an optimal value for win_length as well? Perhaps depending on the target bpm range.

You can change the frame length if you want, but I don't think it will buy you much especially in this style of music.

 
I can see that win_length defaults to n_fft size, which does not seem do default to anything, yet the function works without providing a value for it, so I'm not sure from which value to start experimenting. 

Yeah, this is a somewhat finnicky point in the documentation.  We generally try to be explicit about what parameters are passed through functions (via variable keyword arguments, kwargs), but in the specific case of onset strength, it's difficult because the underlying function call can be user specified.  The default is melspectrogram, so any additional arguments that melspectrogram takes can be passed in through onset_strength.

 

But I still don't get how the oenv relates to the stft.

The very short summary is as follows:
  • Compute a magnitude spectrogram, call it S[frequency, time]   (where frequency may be measured in mel bins, time in frame indices) ; this tells you how much energy is in a frequency at a given time
  • For each frequency, compute the difference from one time step to the next and threshold at 0: D[frequency, time] = max(0, S[frequency, time] - S[frequency, time-1])  ; this tells you whether energy in a frequency band is increasing at a given time
  • For each time, sum over the frequency range to get the onset envelope:  oenv[time] = sum_frequency D[frequency, time]  ; this combines the contributions at all frequencies
 
My understanding of the science side of signal analysis is very rudimentary as I am approaching this from a creative coder/musicin perspective, so please bear with me:
As I understand it so far:
> The tempogram is basically like a spectogram we have in DAW's and media players, with a vertical resolution equal to the number of fft bins, and a horizontal resolution of 1 sample per processed window.

Not quite.  A tempogram is a spectrogram-like representation derived from the onset envelope instead of the raw signal.  This can be done either using Fourier analysis (see: fourier_tempogram) or autocorrelation analysis (tempogram); in the latter case, we think in terms of "lag" (period) rather than frequency.
 
> But instead of getting high values in the fft bins for excitement in the corresponding pitch frequencies, it uses much larger signal windows to get results in the rhythmic frequencies range.

Basically, yes.  Remember that the frame rate of the onset envelope is already much smaller than that of the original signal (divide by hop length).
 
> The onset envelope is not the same as the tempogram, it does not have a vertical resolution.

Correct.
 
 
Question 1: then how did you plot a tempogram from the onset envelope like that?

librosa.feature.tempogram :)
 
Question 2: When you say 'aggregate that over time', that sounds to me like collapsing the vertical dimension by summing the values of each column, which would return again an onset envelope (as each horizontal row in the tempogram is basically an onset envelope of a single frequency bin) - but instead you get this 'tempo curve', but I can't find anything about that in the docs.

No, it's collapsing the horizontal (time) dimension.  The idea is to measure the energy at each tempo at every time step. If we want to know about energy at a tempo across all time, then we eliminate the time variable.

Regarding the "tempo curve", see the examples in https://librosa.org/doc/latest/generated/librosa.feature.tempo.html
Reply all
Reply to author
Forward
0 new messages