Question about librosa.feature.melspectogram output

joseph.fe...@myci.csuci.edu

unread,

Feb 27, 2018, 8:16:34 PM2/27/18

to librosa

I'm trying to determine why there are so many values in my np.ndarray returned from the function found in the Question title.
Im setting the return parameter 'n_mels = 128' which I can confirm in my output because I am returned a list containing 128 lists. Now, in each of these 128 lists there are many values (~177) and I'm unsure as to why there is not just one. Can anyone shed some light on this for me?

Eric Robinson

unread,

Feb 27, 2018, 8:21:41 PM2/27/18

to joseph.fe...@myci.csuci.edu, librosa

How long is your audio input? The Melspectrogram is 2-Dimensional where one of those dimensions is time. The 177 entries are your time dimension output and depend on the input settings (how many entries you get per second).

On Feb 27, 2018, at 8:16 PM, joseph.fe...@myci.csuci.edu wrote:

I'm trying to determine why there are so many values in my np.ndarray returned from the function found in the Question title.
Im setting the return parameter 'n_mels = 128' which I can confirm in my output because I am returned a list containing 128 lists. Now, in each of these 128 lists there are many values (~177) and I'm unsure as to why there is not just one. Can anyone shed some light on this for me?

--
You received this message because you are subscribed to the Google Groups "librosa" group.
To unsubscribe from this group and stop receiving emails from it, send an email to librosa+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/librosa/7755e173-b396-4283-9f44-6a84867e082c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

joseph.fe...@myci.csuci.edu

unread,

Feb 28, 2018, 3:08:21 PM2/28/18

to librosa

My audio file in 4sec long. Can you direct me to where i can modify how many readings i get per second?

Eric Robinson

unread,

Feb 28, 2018, 3:15:45 PM2/28/18

to joseph.fe...@myci.csuci.edu, librosa

Take a look at the documentation for the parameters of the librosa.feature.melspectrogram function. The n_fft parameter will determine how many samples you want per-evaluation (must be power-of-2). That value also determines (to some extent) the "frequency resolution" of the intermediary (and final) spectrogram. Convert that into seconds by using the sr (sample rate) value (typically 22050 [also the default] if you're working from the examples). The hop_length will determine how far to jump along the timeline [in sample space, not seconds!] between evaluations (this is probably what you're looking for). As the documentation for that parameter states, more information is available in the stft documentation.

To unsubscribe from this group and stop receiving emails from it, send an email to librosa+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/librosa/6842c837-5c9a-4b57-a4ad-980405fce7c5%40googlegroups.com.

joseph.fe...@myci.csuci.edu

unread,

Mar 2, 2018, 2:09:30 PM3/2/18

to librosa

Ive messed around with the hop_length parameter; however, I lose information if I increase the value. I'm interested in using the output I receive from the melspectogram function as input to a neural network's input layer. I don't see how 177 values can be used as input. I can see how 128 bands can be matched with 128 "neurons" for my input layer; but, each neuron receiving 177 values? Any idea as to how I could approach this problem?

Eric Robinson

unread,

Mar 2, 2018, 2:59:30 PM3/2/18

to joseph.fe...@myci.csuci.edu, librosa

Unfortunately, I've not had a chance to experiment with neural networks yet.

Anyone else have helpful insights here? Website references?

To unsubscribe from this group and stop receiving emails from it, send an email to librosa+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/librosa/2b02bd82-c1ea-41b3-9d34-adce1ea37c93%40googlegroups.com.

Brian McFee

unread,

Mar 2, 2018, 3:06:31 PM3/2/18

to librosa

I think the original poster is not correctly interpreting the dimensions of the results produced by melspectrogram.

Mel-spectrogram, like all of the other feature extraction functions in librosa, produces a set of values for each frame of audio. By convention, the frame index is the last dimension of the result, and the number of channels/values per frame is the first dimension. So if you have a mel spectrogram output with shape (128, 177), that means you have 177 frames of audio, each of which has 128 mel channel values.

If you want to feed an individual mel spectral frame into a neural network (or any other kind of model), you need to slice out along the second channel, like:

melspec = librosa.feature.melspectrogram(y=y, sr=sr)

observed_frame = melspec[:, 10] # slice the 11th frame out of the mel spectrogram

Does that clear things up?

On Friday, March 2, 2018 at 2:59:30 PM UTC-5, Eric Robinson wrote:

Unfortunately, I've not had a chance to experiment with neural networks yet.

Anyone else have helpful insights here? Website references?

joseph.fe...@myci.csuci.edu

unread,

Mar 2, 2018, 3:35:13 PM3/2/18

to librosa

You were right that I needed that clarification Brian, thanks. My question now is, being it so that each frame of audio is as important as the rest, I want to feed 177 frames into the neural net, i think. So, what I'm thinking is that I need an input layer to accomodate 128 mel channels x 177 frames per mel channel which would be an input layer of 128x177 neurons..? What do you think?

Brian McFee

unread,

Mar 5, 2018, 9:26:23 AM3/5/18

to librosa

That's one way to do it, but note that any subsequent inference you'd like to carry out will have to be on 128x177 patches as well.

There are plenty of other architectures available, however. You might consider using a convolutional network (with convolutions over the time axis) to exploit temporally localized structure, or a recurrent network to exploit long-range interactions.

Message has been deleted

joseph.fe...@myci.csuci.edu

unread,

Mar 5, 2018, 11:38:21 AM3/5/18

to librosa

Can you suggest any reading material on those two nets or maybe how they would apply for me? You’re using a lot of verbage I’m not familiar with I.e. “temporally localized structure” or “long range interactions”

Reply all

Reply to author

Forward