Monophonic Audio to Midi

Casper Leerink

unread,

Jan 9, 2019, 6:48:32 PM1/9/19

to Magenta Discuss

Hi Everyone,

I am new to Magenta I am following the CADL course on Kadenze, I was wondering if there is a monophonic audio to midi model that I can train on my own dataset? I could only find a polyphonic version for piano.

Also if there is none. Does someone know where to start, what kind of network would work the best for this?

So input would be audio (speech, < 20seconds) and out comes a midi transcription of the audio.

I will have time to build my own model myself, but I don't know where to start. Should I look into Onsets and Frames or does polyphonic midi transcription require a different set of layers?

Thank you, Casper

Ian Simon

unread,

Jan 9, 2019, 7:01:30 PM1/9/19

to Casper Leerink, Magenta Discuss

If you want discrete notes as output it's a trickier problem, but if you'd be satisfied with continuously-varying frequency there's a lot of free pitch tracking software out there (not even ML-based for the most part).

A few examples:

https://librosa.github.io/librosa/_modules/librosa/core/pitch.html

https://github.com/aubio/aubio#tools

For many instruments it would probably be pretty straightforward to extract discrete note events from the pitch track; there may even be free software that does this but I'm not aware of it. For vocals it's harder.

Of course, if you're dealing with monophonic piano (or something vaguely piano-sounding) you could try using Onsets and Frames as is...

-Ian

--
Magenta project: magenta.tensorflow.org
To post to this group, send email to magenta...@tensorflow.org
To unsubscribe from this group, send email to magenta-discu...@tensorflow.org
---
You received this message because you are subscribed to the Google Groups "Magenta Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to magenta-discu...@tensorflow.org.

Casper Leerink

unread,

Jan 9, 2019, 7:08:47 PM1/9/19

to Magenta Discuss, casper...@gmail.com

Thank you Ian,

Yes I want the notes, I was already looking into continuous-varying freq together with onset and offset detection but I think for the audio I want to work with I need machine learning. Because speech has a lot of variation in the pitch that classic pitch detectors are detecting but aren't of interest to me. So I would have to let it learn how to know which variations to detect and which ones are not heard by humans.

So another idea is to use these continuous pitch detect data as well as energy(loudness) at each datapoint as input and output the notes as midi. (But I don't know if that works better or worse then using the audio as input, I guess I can try both)

Ian Simon

unread,

Jan 9, 2019, 7:18:43 PM1/9/19

to Casper Leerink, Magenta Discuss

Oh, I remember hearing about something called imitone that supposedly converts vocals to MIDI melody. Never tried it myself though, and it's not free; maybe someone else on the list has experience with it.

-Ian

Reply all

Reply to author

Forward