scipy.signal.stft to librosa.core.stft

499 views
Skip to first unread message

Alexander Botev

unread,
Jun 25, 2017, 10:49:12 AM6/25/17
to librosa
Hi,
I'm trying to translate the scipy.signal.stft to librosa.core.stft.
However, so far I'm failing quite miserbly. Please keep in mind I have not worked before in signal processing and I might be overlooking something very obvious.
The scipy stft doc - https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.signal.stft.html
I have the following example, trying to get same result in librosa:

fs = 10e3
N = 1e5
amp = 2 * np.sqrt(2)
noise_power = 0.01 * fs / 2
time = np.arange(N) / float(fs)
mod = 500*np.cos(2*np.pi*0.25*time)
carrier = amp * np.sin(2*np.pi*3e3*time + mod)
noise = np.random.normal(scale=np.sqrt(noise_power),
                          size=time.shape)
noise *= np.exp(-time/5)
x = carrier + noise

# Scipy
nperseg = 256
nfft = nperseg
noverlap = nperseg // 2
f, t, Zxx1 = signal.stft(x, fs,
                        nperseg=nperseg,
                        noverlap=noverlap,
                        nfft=nfft)
print(f.shape, t.shape, Zxx1.shape)
plt.pcolormesh(t, f, np.abs(Zxx1), vmin=0, vmax=amp)
plt.title('STFT Magnitude')
plt.ylabel('Frequency [Hz]')
plt.xlabel('Time [sec]')
plt.show()

# Librosa
Zxx2 = librosa.core.stft(x,
                        win_length=nperseg,
                        hop_length=nfft - noverlap,
                        n_fft=nfft)
print(f.shape, t.shape, Zxx2.shape)
plt.pcolormesh(t, f, np.abs(Zxx2), vmin=0, vmax=amp)
plt.title('STFT Magnitude')
plt.ylabel('Frequency [Hz]')
plt.xlabel('Time [sec]')
plt.show()

First, even this example has a small difference. The scipy result is  (129, 783) while the librosa is (129, 783)
However, the bigger issue is that the results are significantly different from each other in the sense that
print(Zxx1[0, :5], Zxx2[0, :5]) gives:
[-0.14494054+0.j -0.38635139+0.j -0.04261852+0.j -0.27211995+0.j -0.72294187+0.j]
[-30.86872292-0.j -49.45297623-0.j -5.45517063-0.j -34.83135223-0.j -92.53656006-0.j]

As you can see the librosa values are significantly higher. However, I did not fully understand how to translate other arguments between the two,
for instance the boundary in scipy can be ['even', 'odd', 'constant', 'zeros', None] while librosa has pad_mode='reflect' but there is no documentation on what other options there are.

I just need some help on understanding this.

Alexander Botev

unread,
Jun 25, 2017, 6:23:05 PM6/25/17
to librosa


So just to clarify I'm interested in working with the LibriASR corpus, however I want to make sure what I'm doing makes sense.

Below I show code which I use to compare the Mel Spectrograms achieved in different way.
However, they are not the same, thus I was wondering if someone could explain me why they are not the same so I'm sure what I will use later is fine.

def plot_folder(ID, BOOK, window_size=2048, overlap=512):
    folder = FOLDER.format(ID, BOOK)
    transcript = TRANSCRIPT.format(ID, BOOK)
    with open(os.path.join(folder, transcript), encoding="utf-8") as file:
        text = file.readlines()
    for line in text:
        line = line.strip()
        sep = line.index(" ") + 1
        words = line[sep + 1:].lower()
        file_name = os.path.join(folder, line[:sep-1] + ".flac")
        sr = 16000
        data, sr = librosa.core.load(file_name,sr=sr)

       
        nperseg = 256
        nfft = nperseg
        noverlap = nperseg // 2
       
        # Direct Mel Spectrogram
        mfcc0 = librosa.feature.melspectrogram(data, sr,
                                               hop_length=nfft - noverlap,
                                               n_fft=nfft)
        mfcc0 = librosa.logamplitude(mfcc0, ref_power=np.max)
       
        # Mel Spectrogram scipy's spectrogram
        f, t, sxx = spectrogram(data, sr,
                                nperseg=nperseg,
                                noverlap=noverlap,
                                nfft=nfft)
        mfcc1 = librosa.feature.melspectrogram(data, sr, S=sxx)
        mfcc1 = librosa.logamplitude(mfcc1, ref_power=np.max)
       
        # Mel Spectrogram using librosa stft
        sxx2 = librosa.core.stft(data,
                                 win_length=nperseg,
                                 hop_length=nfft - noverlap,
                                 n_fft=nfft)
        mfcc2 = librosa.feature.melspectrogram(data, sr, S=sxx2)
        mfcc2 = librosa.logamplitude(mfcc2, ref_power=np.max)
        print(mfcc0.shape, mfcc1.shape, mfcc2.shape)
       
        plt.figure(figsize=(15,8))
        plt.subplot(3, 1, 1)
        plt.imshow(mfcc0)
        plt.title("Direct Mel Spectrogram")
        plt.subplot(3, 1, 2)
        plt.imshow(mfcc1)
        plt.title("Mel Spectrogram via scipy")
        plt.subplot(3, 1, 3)
        plt.imshow(mfcc2)
        plt.title("Mel Spectrogram via stft")
        plt.show()

An example output is shown in the figure.

Also are those horizontal line standard effects?

Dan Ellis

unread,
Jun 26, 2017, 9:27:44 AM6/26/17
to Alexander Botev, librosa
The difference between spicy.stft and librosa.stft looks to be simple scaling:

>>> a = np.array([-0.14494054+0.j, -0.38635139+0.j, -0.04261852+0.j, -0.27211995+0.j, -0.72294187+0.j])

>>> b = np.array([-30.86872292-0.j, -49.45297623-0.j, -5.45517063-0.j, -34.83135223-0.j, -92.53656006-0.j])

>>> print b/a

[ 212.97507875-0.j  127.99999563-0.j  128.00000164-0.j  127.99999497-0.j

  128.00000097-0.j]


There's a 1/N factor that has to be inserted somewhere in the STFT/ISTFT cycle, but it's a matter of convention where it gets put in.  Looks like the two implementations made different choices.  (And there's another factor of 2 somewhere, perhaps in the window scaling).

[I'm confused that the result is pure real since I don't see how the first window is going to be even, which this implies.]

The larger difference in the d.c. component I can't immediately explain, butI wouldn't worry about it.  The match in the next 4 values convinces me that they are calculating the same fourier transform in both cases.

  DAn.

--
You received this message because you are subscribed to the Google Groups "librosa" group.
To unsubscribe from this group and stop receiving emails from it, send an email to librosa+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/librosa/f1bd9e10-1067-434f-bc3d-587588e94a4a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dan Ellis

unread,
Jun 26, 2017, 10:12:53 AM6/26/17
to Alexander Botev, librosa
The mel spectra all look pretty similar to me, modulo colormap scaling; try setting vmin=-60.0 in each of the imshow calls, or adding a colorbar to each to reveal any differences in range.

But that striping in the low frequencies is really bad.  Somehow you're being given too many mel bands (so some of your mel bands are "falling into the gaps" in the lower spectrum); perhaps your nfft is smaller than the routine expects?

I can't help pointing out that your spectrograms are upside-down (to my eyes).  Use origin="bottom" if you must use imshow, or use librosa.display.specshow.

  DAn.

--
You received this message because you are subscribed to the Google Groups "librosa" group.
To unsubscribe from this group and stop receiving emails from it, send an email to librosa+unsubscribe@googlegroups.com.

Alexander Botev

unread,
Jun 26, 2017, 12:08:11 PM6/26/17
to librosa, bot...@gmail.com, dp...@ee.columbia.edu
Hmm, so I guess that might be a bit suspicously looking from my example, but trying `print(Zxx2 / Zxx1[0, :-1])` gives:

[[ 307.35359669 -0.00000000e+00j  127.99999612 -0.00000000e+00j
   128.00000309 -0.00000000e+00j ...,  128.00000615 +0.00000000e+00j
   127.99999886 +0.00000000e+00j  246.25243487 +0.00000000e+00j]
 [-107.21650777 -8.86480832e-14j  -74.31984309 +1.11443954e+02j
   -89.04640530 -7.53675603e+00j ...,  105.58323792 +1.18760732e+02j
   -57.50047188 -2.43688958e+00j -163.00358643 -1.46036493e+02j]
 [-133.67438165 -4.22209770e-14j   94.01754556 -2.00440348e+02j
    24.97926125 -3.44702747e+01j ..., -208.76517328 +3.93691628e+00j
    -2.84837747 +3.26670652e+01j  -51.55900936 +2.97177201e+02j]
 ..., 
 [ 292.19523512 +1.54592899e-14j -101.73281910 +2.60337865e+02j
  -102.91160333 +7.05384684e+01j ..., -178.73944503 -1.15004031e+02j
    27.02306390 +1.14095236e+02j  -14.74832880 +3.54251055e+02j]
 [ 137.31903199 -1.83986588e-14j   55.92114137 -1.20411546e+02j
    95.50501294 -2.00852097e+01j ...,  454.94744829 +4.76787857e+01j
     4.59906273 +1.40259281e+02j -310.24637069 -1.67919984e+02j]
 [ -95.82788060 +0.00000000e+00j  -47.57751836 +0.00000000e+00j
   -84.15896540 +0.00000000e+00j ..., -611.69060637 -0.00000000e+00j
    94.74219291 +0.00000000e+00j  620.18834661 +0.00000000e+00j]]

So it seems that the scaling you descirbe is true only for the first window of the STFT, but is different on all others.
To unsubscribe from this group and stop receiving emails from it, send an email to librosa+u...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages