Superbock's work on CNN based onset detection

James Donnelly

unread,

Mar 5, 2016, 3:25:05 PM3/5/16

to madmom-users

Hi,

I don't see any posts on the list.

I read your paper (with Jan Schlüter):

IMPROVED MUSICAL ONSET DETECTION WITH CONVOLUTIONAL NEURAL NETWORKS

I was wondering if any of the techniques mentioned will become part of the Madmom code at some point, if not already.

I am have just managed to get Madmom installed, so can now begin to study what onset detection capabilities it has.

I am very interested in NN based onset and tempo detection.

Cheers,

James

James Donnelly

unread,

Mar 5, 2016, 4:36:06 PM3/5/16

to madmom-users

I just found the Docs, oops:

http://madmom.readthedocs.org/

I notice in the API docs you have a submodule dedicated to RNN, but not one for CNN.

Is it fair to say that RNN is adequate for MIR from monophonic sources and that CNN provides an improved performance for polyphonic material?

If so, would it be true to say that CNN methods would not yield an improved result for monophonic signals?

I am not an academic or even a Python developer, but I have been developing software for 15 years. My goal is to prototype some processing workflows which will give a really nice tempo map of monophonic sources, with the ability to produce a midi output from an audio input.

As I'm new to MIR, I don't know how tempo mapping is done, but I would think you could model underlying tempo variation as a separate thing from local note deviation. I would think that these would feed into the produced map with different weights, but just speculating.

I ultimately I would anticipate training the network as part of an ongoing project, but I have no clue about the extent to which any trained networks you have are converged for this kind of task.

Now I see from the docs your library has utilities for interfacing with MIDI, it seems to me it's all there. But python and signal processing are outside the areas I've worked in commercially (I have written a real-time graphical spectrum analyser , so know a bit about Nyquist and FFT, and also written a successful captcha breaking prototype using fast forward neural networks. These were both just for fun).

I now plan to work through the API docs and try to implement some stuff, but if you did have any sample projects you think might help that are in the public domain, I would be very grateful.

James Donnelly

unread,

Mar 5, 2016, 5:02:17 PM3/5/16

to madmom-users

It seems I'm filling up your mailing list by myself. Sorry about that.

On your python page, it doesn't seem to mention the readthedocs pages, hence my initial failure.

https://pypi.python.org/pypi/madmom/0.12

I should have noticed however that the above page does mention that the bin folder is FULL of amazing sample applications. Great!

Maybe worth noting that this fact is missed on the readthedocs pages.

If I'm wrong, oops again, but just trying to help.

Sebastian Böck

unread,

Mar 6, 2016, 12:34:56 AM3/6/16

to madmom-users

Hi,

On Saturday, 5 March 2016 23:02:17 UTC+1, James Donnelly wrote:

On your python page, it doesn't seem to mention the readthedocs pages, hence my initial failure.

https://pypi.python.org/pypi/madmom/0.12

There will be a new release soon. It will include Python 3 support, some new code, and an updated PyPI readme with a link to the docs.

Sebastian Böck

unread,

Mar 6, 2016, 12:50:19 AM3/6/16

to madmom-users

Hi,

On Saturday, 5 March 2016 22:36:06 UTC+1, James Donnelly wrote:

I notice in the API docs you have a submodule dedicated to RNN, but not one for CNN.

The neural network code will be reworked in the next couple of weeks, to not only include RNNs but also other networks. We hope to get the CNN stuff in as well, but this is a bit more tricky to do it numpy only with reasonable speed.

Is it fair to say that RNN is adequate for MIR from monophonic sources and that CNN provides an improved performance for polyphonic material?

If so, would it be true to say that CNN methods would not yield an improved result for monophonic signals?

No, I wouldn't say so. RNNs can deal with polyphonic input as well. The question of RNNs vs. CNNs is not easy to answer and both technologies have their strengths and weaknesses. CNNs are great if the temporal context is somehow limited (as for onset detection) and RNNs are great if the context is longer (e.g. beat tracking). Both can do both, but for some stuff the one or the other might be better suited.

I am not an academic or even a Python developer, but I have been developing software for 15 years. My goal is to prototype some processing workflows which will give a really nice tempo map of monophonic sources, with the ability to produce a midi output from an audio input.

As I'm new to MIR, I don't know how tempo mapping is done, but I would think you could model underlying tempo variation as a separate thing from local note deviation. I would think that these would feed into the produced map with different weights, but just speculating.

Traditionally, autocorrelation, FFT, and comb filters are used for tempo estimation. Whichever you chose depends on your input, but the differences might be rather small compared to that of your features.

We found that it helps a lot to have a musically meaningful input to the system. E.g. we used the output of a RNN trained to predict beat positions as input to a bank of resonating comb filters quite successfully.

I ultimately I would anticipate training the network as part of an ongoing project, but I have no clue about the extent to which any trained networks you have are converged for this kind of task.

You can try if the networks included produce some meaningful output. The models are in /madmom/models, and examples on how to use are in /bin (as you discovered already).

James Donnelly

unread,

Mar 6, 2016, 5:02:49 AM3/6/16

to madmom-users

Thank you for your reply.

On Sunday, 6 March 2016 05:50:19 UTC, Sebastian Böck wrote:

We found that it helps a lot to have a musically meaningful input to the system. E.g. we used the output of a RNN trained to predict beat positions as input to a bank of resonating comb filters quite successfully.

I found your paper on this one. I will study further, but this work, along with other insights I have obtained looking at the code and sample applications, seem to indicate the primary tempo detection goal is to measure the dominant periodicity.

The area I am interested in is around modelling tempo dynamically changing over time. The traditional music sequencer recording modality is to have the musician recording performances along to a click track. Because with many music forms variations in tempo will exist organically, the idea is to produce a map consisting of tempo change markers over time.

This will facilitate a recording model where 'natural', i.e. non-rigid tempo conformant performances can be captured 'as is' into the sequencer, and a tempo map used to quantise other tracks to the performance.

I was thinking perhaps that the source music could be divided into segments, perhaps with down-beat detection, and a tempo estimation obtained for that segment, which is fed both by onsets detected within it, and also by the wider tempo context of the piece derived from all segments.

There are a number of mainstream products that are able to do rudimentary tempo mapping and beat detection, but I feel that the state of the art NN driven approach that you have helped to pioneer has yet to find its way into practical day to day use.

You can try if the networks included produce some meaningful output. The models are in /madmom/models, and examples on how to use are in /bin (as you discovered already).

I have been able to experiment with some of the sample applications, particularly PianoTranscriptor and TempoDetector. With PianoTranscript I went straight for a recording of Beethoven's Piano Sonata No. 1 in F minor Opus 2, No. 1, Alegro, a very challenging piece, I would have thought. The results were not perfect, but it was a revelation how well the sample code did, and has provided some encouragement for further experiments on polyphonic material.

Sebastian Böck

unread,

Mar 6, 2016, 8:00:07 AM3/6/16

to madmom-users

Hi,

On Sunday, 6 March 2016 11:02:49 UTC+1, James Donnelly wrote:

I found your paper on this one. I will study further, but this work, along with other insights I have obtained looking at the code and sample applications, seem to indicate the primary tempo detection goal is to measure the dominant periodicity.

Yes, that's correct.

The area I am interested in is around modelling tempo dynamically changing over time. The traditional music sequencer recording modality is to have the musician recording performances along to a click track. Because with many music forms variations in tempo will exist organically, the idea is to produce a map consisting of tempo change markers over time.

So what you basically want is beat tracking.

This will facilitate a recording model where 'natural', i.e. non-rigid tempo conformant performances can be captured 'as is' into the sequencer, and a tempo map used to quantise other tracks to the performance.

I'm not sure if I understand what you mean by tempo map, but I guess that the beat interval can be provide that. You might need to smooth it somehow.

I was thinking perhaps that the source music could be divided into segments, perhaps with down-beat detection

A bird told me that someone is working on downbeat detection, so stay tuned ;)

James Donnelly

unread,

Mar 6, 2016, 3:50:54 PM3/6/16

to madmom-users

Thanks for your pointers, I think I am starting to understand the practicalities now. As for the maths, well let's say that is a way off.

On Sunday, 6 March 2016 13:00:07 UTC, Sebastian Böck wrote:

So what you basically want is beat tracking.

Well, I just learned what beat tracking is...I told you I was new to this :)

Yes, tempo detection was a 'red herring' as we say in the UK. The tempo map can be derived from the output of the beat detection.

Early signs are that it's producing a perfect tempo map from my test piece of guitar strumming.

This will facilitate a recording model where 'natural', i.e. non-rigid tempo conformant performances can be captured 'as is' into the sequencer, and a tempo map used to quantise other tracks to the performance.

I'm not sure if I understand what you mean by tempo map, but I guess that the beat interval can be provide that. You might need to smooth it somehow.

In music sequencing terms I believe the term tempo map is generally understood to mean a series of tempo change markers which are usually inserted at regular musical intervals, often every 4 or 8 bars. Once the sequencer has these tempo change markers in place, effectively the grid of the sequencer arrange window has now been adapted to the musical progression of the recording. As I mentioned, this allows other tracks to be quantised to the grid present in the variable tempo recording.

The form a tempo map would take for exchange between software tools could for instance be an otherwise empty midi file with the tempo changes inserted at the regular intervals.

The smoothing you speak of is what I was struggling to explain concisely about when talking of the "wider tempo context" when evaluating onsets, but actually I now understand we're talking about detected beats, not onsets.

I was thinking perhaps that the source music could be divided into segments, perhaps with down-beat detection

A bird told me that someone is working on downbeat detection, so stay tuned ;)

Apparently I won't need this as beat tracking will do it as you say, but I'll stay tuned anyway.

Some feedback on OnsetDetection.py (actually I tested the LL one), it worked first time to detect the onsets (rhythmic strum points) from the test guitar piece. Something that has been impossible for me to do in a single pass with a transient detection algorithm.

Reply all

Reply to author

Forward