Tempo Estimation and Beat tracking with Bidirectional Neural Networks

anjle...@hotmail.co.uk

unread,

Oct 16, 2019, 9:22:27 AM10/16/19

to madmom-users

Hi,

I am new to AI so excuse me if any of my questions are quite basic. I am very interested in the uses and benefits of using AI in Music Information Retrieval and I have found these papers very interesting and informative. I have been following these papers to conduct experiments for a project that I am working on:

http://mir.minimoog.org/sb-diploma-thesis

https://pdfs.semanticscholar.org/1129/afd1d187687755980fd9e2473ac4df45bbd6.pdf

I believe I have done well to reproduce some of this work which has been referenced on the Madmoms website,

I have created a Bidirectional LSTM using keras and the SMC Mirex database, although I have been having a few issues with my network which may be due to my feature extraction and my post processing using the Madmom package. I was wondering if you could help by answering a few questions about the articles:

During feature extraction we create a feature vector of 6 different types of spectrograms, but also in the first article filter banks are mentioned, does this have any relation to MFCC's described here: https://musicinformationretrieval.com/mfcc.html
For the beat target vectors I am reading in the annotations from the data set and converting the time stamps to frames then creating target vectors of 0 or 1 which has the same length as the number of frames in each audio file. Is this the correct method and am I correct in using binary cross entropy with a sigmoid dense layer on my network?
My accuracy for my network seems to be good, I have produced an accuracy of 0.9860 after just 25 epochs, however my loss seems to be quite high starting at 14.7 and decreasing to 8.5 and with the looks of things no over or under fitting in comparison to the validation set, is this normal for the network?
Finally, I have attempted to use Madmom 0.16.1 to compute a Tempo estimation but I keep receiving the same tempo for each sound file when using the histogram based method.
I have also attempted to follow the first paper and use the autocorrelation method, specifically section 3.3 however the threshold function madmom.features.beats.threshold_activations(activations, threshold) does not seem to exist in the package. And also whenever I pass in my output vector to Numpy or Librosa's autocorrelation function I come out with a straight line

I am not very sure how to replicate the results from the articles or how to use the functions created for tempo estimation using my own neural network, could you please help to point me in the right direction,

Thank you,

Anjlee

Sebastian Böck

unread,

Oct 16, 2019, 10:13:04 AM10/16/19

to madmom-users

Hi,

On Wednesday, 16 October 2019 15:22:27 UTC+2, anjle...@hotmail.co.uk wrote:

I have created a Bidirectional LSTM using keras and the SMC Mirex database, although I have been having a few issues with my network which may be due to my feature extraction and my post processing using the Madmom package. I was wondering if you could help by answering a few questions about the articles:
During feature extraction we create a feature vector of 6 different types of spectrograms, but also in the first article filter banks are mentioned, does this have any relation to MFCC's described here: https://musicinformationretrieval.com/mfcc.html

yes and no. MFCCs are Mel filtered cepstral coefficients, so first a Mel filters spectrogram is computed. This is also what I used back then as input to the neural network. No ceptsral coefficients (another DCT on the spectrogram) are used though.

For the beat target vectors I am reading in the annotations from the data set and converting the time stamps to frames then creating target vectors of 0 or 1 which has the same length as the number of frames in each audio file. Is this the correct method and am I correct in using binary cross entropy with a sigmoid dense layer on my network?

Exactly.

My accuracy for my network seems to be good, I have produced an accuracy of 0.9860 after just 25 epochs, however my loss seems to be quite high starting at 14.7 and decreasing to 8.5 and with the looks of things no over or under fitting in comparison to the validation set, is this normal for the network?

Yes this is to be expected. However, the accuracy values are basically meaningless, because the only thing they tell you is that you classify 98.6% of all frames correctly. I guess that the remaining 1.4% are mostly the frames with a target value of 1. So you must either compute the accuracy only for these frames or ignore it altogether. In my experience it is enough to monitor the cross-entropy loss.

Finally, I have attempted to use Madmom 0.16.1 to compute a Tempo estimation but I keep receiving the same tempo for each sound file when using the histogram based method.

This is a bit hard to believe...

I have also attempted to follow the first paper and use the autocorrelation method, specifically section 3.3 however the threshold function madmom.features.beats.threshold_activations(activations, threshold) does not seem to exist in the package. And also whenever I pass in my output vector to Numpy or Librosa's autocorrelation function I come out with a straight line

No, there is no function with that name. But why do you think there should be? Although some algorithms included in madmom are based on these early works, almost everything has changed in the meantime.

However, the development version of madmom has a `interval_histogram_acf()` function which basically does what you're trying to accomplish — computing a tempo histogram from a beat activation function.

I am not very sure how to replicate the results from the articles or how to use the functions created for tempo estimation using my own neural network, could you please help to point me in the right direction

I'd try to replicate newer works, since the older they are, the more difficult it is to get the exact same data.

However, the biggest problem might be that keras (or tensorflow) has an incomplete LSTM implementation which does not have peephole connections. But these seem to be quite important for accurate timing of the events. I once tried to reproduce my own beat tracking results and was not able to do so. But I gave up on this quite quickly, because keras/TF is not performing worse, but also is ~25x slower than RNNLIB using only a single CPU core.

Just a heads up: in roughly 2 weeks from now ISMIR is taking place and I will release the code and data for our newest multi-task beat tracking and tempo estimation system. It has basically the same beat tracking performance as the BLSTM approaches, but is much faster to train.

HTH

anjle...@hotmail.co.uk

unread,

Oct 17, 2019, 5:12:54 AM10/17/19

to madmom-users

For starters, thank you for the quick reply!

This is where I found that function that I mentioned, maybe you should remove this from the docs?

https://madmom.readthedocs.io/en/latest/modules/features/beats.html#madmom.features.beats.threshold_activations

Thank you for the advice about the peephole connections, TF 2.0 seem to have an experimental peephole LSTM cell which I will try on my model but I will keep that in mind, what framework would you suggest instead?

That's great I look forward to seeing your new system! You will be posting the code online as well or an article?

This is what I am doing to get the tempos from my model, is this correct?

tempos=[]

predictions = []

for files in myfiles:

[y,sr]=librosa.load(files)

[X,frame]=myfeat(y,sr)

X = np.array(X)

X= X[np.newaxis,:,:]

model = load_model('musicbeatsmod_v3.h5')

Y=model.predict(X)

predictions.append(Y)

histogram = mdmtempo.interval_histogram_acf(np.squeeze(Y),min_tau=20, max_tau=None)

tempo = mdmtempo.detect_tempo(histogram,fps=100)

tempos.append(tempo)

Sebastian Böck

unread,

Oct 17, 2019, 5:24:49 AM10/17/19

to madmom-users

Hi,

On Thursday, 17 October 2019 11:12:54 UTC+2, anjle...@hotmail.co.uk wrote:

For starters, thank you for the quick reply!

This is where I found that function that I mentioned, maybe you should remove this from the docs?

https://madmom.readthedocs.io/en/latest/modules/features/beats.html#madmom.features.beats.threshold_activations

I am really sorry, but of course you are right. But this function does not have the functionality anticipated by you, it does what the docstring says: it returns only the main segment which has higher activation than the given value.

Thank you for the advice about the peephole connections, TF 2.0 seem to have an experimental peephole LSTM cell which I will try on my model but I will keep that in mind, what framework would you suggest instead?

I trained all my RNNs with RNNLIB.

That's great I look forward to seeing your new system! You will be posting the code online as well or an article?

Yes, it will be added to madmom quite soon.

This is what I am doing to get the tempos from my model, is this correct?

tempos=[]
predictions = []
for files in myfiles:
[y,sr]=librosa.load(files)
[X,frame]=myfeat(y,sr)
X = np.array(X)
X= X[np.newaxis,:,:]
model = load_model('musicbeatsmod_v3.h5')
Y=model.predict(X)
predictions.append(Y)
histogram = mdmtempo.interval_histogram_acf(np.squeeze(Y),min_tau=20, max_tau=None)
tempo = mdmtempo.detect_tempo(histogram,fps=100)
tempos.append(tempo)

it is hard to say that the above code works or not. What I suspect is that your predictions of the network are always the same, and thus the tempo predictions are as well. So the question is: how does the beat activation function look like and what tempo is detected then. You could compare your prediction for a certain file with what madmom predicts. Simply create an instance of `madmom.features.beats.RNNBeatProcessor` and call it with the same filename to obtain a beat activation function (an example is shown in the docs: https://madmom.readthedocs.io/en/latest/modules/features/beats.html#madmom.features.beats.RNNBeatProcessor)

HTH

anjle...@hotmail.co.uk

unread,

Oct 18, 2019, 6:09:54 AM10/18/19

to madmom-users

Hi again,

Thank you for all the help! So you were right about the peepholes of course and after adding them I'm getting a much better result from my activation function.

I have realised that my main issue is that my model is predicting the beats at only the start and end of my sequence, thats why the same tempos seem to be coming out because even though the probabilities are different they only peak once at the start and once at the end. And this is a very slow process indeed, I may have to look at RNNLIB like you suggested.

Anjlee

anjle...@hotmail.co.uk

unread,

Nov 14, 2019, 11:02:54 AM11/14/19

to madmom-users

Hi Sebastian,

I hope you're well, I've been trying to keep an eye out for the code you mentioned you were releasing, have you released it yet? And will it be released on github?

Thank you,

Anjlee

Tomáš Suchánek

unread,

Oct 23, 2020, 10:29:08 AM10/23/20

to madmom-users

Hi,

sorry for "breaking" into your conversation but I thought that my question could relate to this topic.

I'm also trying to replicate the BLSTM type beat tracker and have some issues with training the model in Keras.

Since it is more of mine inability to work with Keras rather than a question on madmom, can I ask you Anjlee

for exact email adress, if you don't mind, so I can ask you few questions on fitting the model in Keras?

I believe it will be matter of just a few minutes and I would be very grateful.

Best regards

Tom

Dne čtvrtek 14. listopadu 2019 v 17:02:54 UTC+1 uživatel anjle...@hotmail.co.uk napsal:

Sebastian Böck

unread,

Oct 26, 2020, 4:26:30 AM10/26/20

to madmom-users

As said earlier, I suggest having a look if peepholes connections are used (they make a difference) or use TCNs (temporal convolutional networks) since they are much easier to train and perform on par with BLSTMs (with peephole connections).

HTH

Tomáš Suchánek

unread,

Oct 26, 2020, 9:10:35 AM10/26/20

to madmom-users

Thanks for reply! I intend to get into TCN implementation later, as soon as i get the BLSTM one right. Because when i try

to make the model predict some values after the training, i get only numbers very close to zero, like there are no beats

happening at all. So i guess there's some little bug in the code, which i'd like to talk about with someone who also tried this approach.

I think I've got the labels order right, just like Anjlee wrote year ago and the network overfitts right from like 3rd or 4th epoch.

Dne pondělí 26. října 2020 v 9:26:30 UTC+1 uživatel Sebastian Böck napsal:

Reply all

Reply to author

Forward