online onset RNN detection

108 views
Skip to first unread message

Laubeee

unread,
Mar 27, 2019, 1:24:04 PM3/27/19
to madmom-users
Hi, I am having trouble using the "online" mode of RNN onset detection:


ocnn = CNNOnsetProcessor()
act = ocnn(signal)

ornn = CNNOnsetProcessor()
act2 = ornn(signal)

online_ornn = RNNOnsetProcessor(online=True, origin='online')
act3 = np.zeros(len(act))
for i in range(len(act)):
act3[i] = online_ornn(signal[i * 441:(i + 1) * 441], reset=False)[0]



act and act2 look good, but act3 has a max value of 0.0009...
How is this online mode supposed to work? Do I misunderstand the term "online"? I thought it is meant to process data in "real time" (as opposed to "offline") meaning as soon as there is the next batch of samples I want to know if an onset happened. How can I do that?

Cheers

Sebastian Böck

unread,
Apr 1, 2019, 3:12:24 AM4/1/19
to madmom-users
Hi, sorry for the late reply.

Basically that's the right way, except a minor detail: you're using a rather short frame size of 441 samples. Try changing it to
signal[i * 441:i * 441 + 2048]
and it should work.

You could also use the FramedSignal class instead, it takes care of all the signal slicing with the correct origin and other things. It is easier to you because you can simply iterate over it.

HTH


Silvan Laube

unread,
Apr 1, 2019, 5:55:12 AM4/1/19
to madmom-users
Oh, okay, thanks!
I thought since the Processors return one value for every 441 samples that this has to be the signal size and that they use internal buffering with the "reset=False" or something...
speaking of which: what does reset=False do? I see it prevents clearing some kind of buffer, but following your example I need to pass a full frame each time anyway (instead of "just" the new number-of-hop-size samples), so i wonder what is actually being buffered there..?

anyhow I will play with all the parameters and try to figure what gives me best results.

On the FramedSignal class: The detector is meant to work within a realtime environment where I wont have the luxury of having all samples in advance, so right now it seemed best to call directly with the raw samples. I saw there is also a possibility to use an audio-stream directly which I am thinking about trying as well: does it have any advantage or does it do just the same: wait for the next n samples and feed them to the detector?

kind regards

Sebastian Böck

unread,
Apr 1, 2019, 6:11:33 AM4/1/19
to madmom-users
Setting reset=False prevents the RNN to loose it's internal state.

The Stream class does what you're looking for. It opens a pyaudio stream, is also iterable, blocks until hop_size new samples are present, and does all the buffering to retrieve frame_size samples.

Silvan Laube

unread,
Apr 8, 2019, 7:39:38 AM4/8/19
to madmom-users
Hmm, I still struggle a bit with using the ODF/Peak functions correctly in a realtime environment. There are several problems:
- realtime processing yields different results: when I feed the complete signal at once compared to feeding it "streamed" (in chunks of hop_size)
- it seems to me SpectralOnsetProcessor does not save any internal state and therefore calculates the features way too often, making it notably slower
- the closest results I got with looking at how to calculate the onset-detection-function in a way it matches the result when feeding the whole signal at once as close as possible, which resulted in this code:

fs = 2048
hs = 512
odf = SpectralOnsetProcessor('superflux', filterbank=LogFB, num_bands=36, log=np.log10, sample_rate=w.rate, frame_size=fs, hop_size=hs)
peak_sf = OnsetPeakPickingProcessor(fps=w.rate/hs, threshold=0.75, pre_max=0, combine=0.15, online=True, pre_avg=0)
print(peak_sf(odf(signal)).round(3))

o = peak_sf(odf(signal[0:fs]), reset=True)
offset = int(fs/hs) + 1 # 5
index = offset - 2 # 3
onsets = []
for i in range(int(signal.shape[0]/512)-5): # 3.24s
st = i*hs/48000
f = odf(signal[i*hs:(i+offset)*hs])[index:index+1]
onsets += peak_sf(f, reset=False).round(3).tolist()
print(onsets)

Output:
[1.739, 2.816, 3.819, 4.875, 5.973, 7.029, 8.096, 9.227, 10.304, 11.552, 12.811, 14.752, 15.403]
[1.781, 2.859, 3.861, 4.917, 6.016, 7.072, 8.139, 9.269, 10.347, 11.595, 12.853, 14.795, 15.445]

As you can see the "streamed" version is always ~42ms off
My annotated onsets are between those two but much closer to the first one (maybe 5-10ms later)
So I guess this is O.K. for my use case, but I still wonder whether there is a better way of doing this, so streamed processing matches the "non-streamed" result and without slowing down? It should be possible, right?

Cheers
Silvan Laube

Sebastian Böck

unread,
Apr 8, 2019, 7:53:10 AM4/8/19
to madmom-users
Unfortunately, SpectralOnsetProcessor is not ready for streaming yet. The whole features.onsets would need an overhaul to be streaming/online capable. So there's a good chance that the results are wrong or take unnecessarily long to be computed.

Regarding the offset of the detections: When operating in 'online' mode, FramedSignal defines the origin of the signal differently. It considers only past information, i.e. reference sample 0 contains only 0s as input signal. It thus takes a couple of frames until enough audio information is available to be processed. If I read your code correctly, you read the audio signal "to the future", which is exactly the opposite direction.
Reply all
Reply to author
Forward
0 new messages