On Wednesday, 16 October 2019 15:22:27 UTC+2,
anjle...@hotmail.co.uk wrote:
I have created a Bidirectional LSTM using keras and the SMC Mirex database, although I have been having a few issues with my network which may be due to my feature extraction and my post processing using the Madmom package. I was wondering if you could help by answering a few questions about the articles:
- During feature extraction we create a feature vector of 6 different types of spectrograms, but also in the first article filter banks are mentioned, does this have any relation to MFCC's described here: https://musicinformationretrieval.com/mfcc.html
yes and no. MFCCs are Mel filtered cepstral coefficients, so first a Mel filters spectrogram is computed. This is also what I used back then as input to the neural network. No ceptsral coefficients (another DCT on the spectrogram) are used though.
- For the beat target vectors I am reading in the annotations from the data set and converting the time stamps to frames then creating target vectors of 0 or 1 which has the same length as the number of frames in each audio file. Is this the correct method and am I correct in using binary cross entropy with a sigmoid dense layer on my network?
- My accuracy for my network seems to be good, I have produced an accuracy of 0.9860 after just 25 epochs, however my loss seems to be quite high starting at 14.7 and decreasing to 8.5 and with the looks of things no over or under fitting in comparison to the validation set, is this normal for the network?
Yes this is to be expected. However, the accuracy values are basically meaningless, because the only thing they tell you is that you classify 98.6% of all frames correctly. I guess that the remaining 1.4% are mostly the frames with a target value of 1. So you must either compute the accuracy only for these frames or ignore it altogether. In my experience it is enough to monitor the cross-entropy loss.
- Finally, I have attempted to use Madmom 0.16.1 to compute a Tempo estimation but I keep receiving the same tempo for each sound file when using the histogram based method.
This is a bit hard to believe...
- I have also attempted to follow the first paper and use the autocorrelation method, specifically section 3.3 however the threshold function
madmom.features.beats.
threshold_activations
(activations, threshold) does not seem to exist in the package. And also whenever I pass in my output vector to Numpy or Librosa's autocorrelation function I come out with a straight line
No, there is no function with that name. But why do you think there should be? Although some algorithms included in madmom are based on these early works, almost everything has changed in the meantime.
However, the development version of madmom has a `interval_histogram_acf()` function which basically does what you're trying to accomplish — computing a tempo histogram from a beat activation function.
I am not very sure how to replicate the results from the articles or how to use the functions created for tempo estimation using my own neural network, could you please help to point me in the right direction
I'd try to replicate newer works, since the older they are, the more difficult it is to get the exact same data.
However, the biggest problem might be that keras (or tensorflow) has an incomplete LSTM implementation which does not have peephole connections. But these seem to be quite important for accurate timing of the events. I once tried to reproduce my own beat tracking results and was not able to do so. But I gave up on this quite quickly, because keras/TF is not performing worse, but also is ~25x slower than RNNLIB using only a single CPU core.
Just a heads up: in roughly 2 weeks from now ISMIR is taking place and I will release the code and data for our newest multi-task beat tracking and tempo estimation system. It has basically the same beat tracking performance as the BLSTM approaches, but is much faster to train.
HTH