Augmenting mel scale spectograms with zcr

157 views
Skip to first unread message

Pulkit

unread,
Jun 27, 2021, 2:43:59 AM6/27/21
to librosa
I am trying to implement a part of this paper: https://people.kth.se/~ghe/pubs/pdf/szekely2019casting.pdf

Input is a large audio file(~50 minutes). In the paper, they have 22k(sampling rate) audio file from which they have extracted Mel-spectrograms using Librosa with a window width of 20 ms and 2.5 ms hop length. The resulting spectrograms are of two seconds audio and have 128×800 pixels. Then they augment the frequency domain with ZCR information.

I have 44k(sampling rate) audio file for which I have written this code:

import librosa
sr = 44100
frame_length = 4096 
hop_length = 1024
stream = librosa.stream('final.wav', block_length=128, frame_length=frame_length, hop_length=hop_length)

mel_specs_log = []
zcr = []
for y in stream:
    mel = librosa.feature.melspectrogram(y, sr=sr)
    log_mel=librosa.power_to_db(mel)
    mel_specs_log.append(log_mel)
    zcr.append(librosa.feature.zero_crossing_rate(y))

Can someone please help me with
1. checking if my parameters are okay for 44k audio
2. how to  add ZCR as another image channel to each spectrogram cell

I someone can help me with this, I'd really b e grateful.

Thanks !

Pulkit

unread,
Jun 27, 2021, 3:34:46 AM6/27/21
to librosa
Updated code for the same :


import numpy as np
import librosa
sr = 44100
frame_length = 4096 
hop_length = 1024
stream = librosa.stream('final.wav', block_length=128, frame_length=frame_length, hop_length=hop_length)

mel_specs_log_zcr = []
for y in stream:
    mel = librosa.feature.melspectrogram(y, sr=sr)
    log_mel=librosa.power_to_db(mel)
    zcr = librosa.feature.zero_crossing_rate(y)
    zcr = np.tile(zcr, (128, 0))
    mel_spec_log_zcr = np.concatenate((log_mel, zcr), axis=1)
    mel_specs_log_zcr.append(mel_spec_log_zcr)

Can someone please verify if this is correct and if not point me in the right direction?

Thanks

Brian McFee

unread,
Jun 28, 2021, 1:25:26 PM6/28/21
to librosa
Responses inline below:

On Sunday, June 27, 2021 at 3:34:46 AM UTC-4 Pulkit wrote:
Updated code for the same :


import numpy as np
import librosa
sr = 44100
frame_length = 4096 
hop_length = 1024
stream = librosa.stream('final.wav', block_length=128, frame_length=frame_length, hop_length=hop_length)

mel_specs_log_zcr = []
for y in stream:
    mel = librosa.feature.melspectrogram(y, sr=sr)

If you're going to use block streams, you probably should be using left-aligned analysis (ie, center=False) to ensure consistent results.  Keeping center=True here introduces some padding effects and discontinuities at the beginning and end of each block.

You also need to propagate frame length and hop length here.
 
    log_mel=librosa.power_to_db(mel)
    zcr = librosa.feature.zero_crossing_rate(y)

Frame length and hop length should be included here as well.
 
    zcr = np.tile(zcr, (128, 0))
    mel_spec_log_zcr = np.concatenate((log_mel, zcr), axis=1)

I'm not sure this will work the way that you want it to, but you never really specified how you want these features to be concatenated.

axis=1 here refers to the time (frame) dimension, not the frequency (feature) dimension which is axis=0.  This is why you're tiling the 1-dimensional feature zcr 128 times, but I don't think this is correct either.

I think a better way to do it would be to not tile zcr, and instead do something like:


zcr = librosa.feature.zero_crossing_rate(y, hop_length=hop_length, frame_length=frame_length)
mel_spec_log_zcr = np.concatenate(log_mel, zcr), axis=0)

Pulkit

unread,
Jun 29, 2021, 1:27:14 PM6/29/21
to librosa
Thanks for your response Brian. I understand that I did not make many things clear. That's totally my bad. So here is what I am trying to do:

I am trying to implement this paper : https://people.kth.se/~ghe/pubs/pdf/szekely2019casting.pdf (This paper trains a speaker specific breath detection model which is used to clean corpus for TTS system)
In the paper they use librosa to extract log magnitude mel spectrograms from audio, which after augmenting with ZCR values are used to train a neural network. By augmenting ZCR values they mean adding ZCR values as a separate channel to the extracted mel spectrograms. They are doing the following in the paper
1. Extract log-magnitude mel spectrograms from raw audio with window width of 20 ms and 2.5 ms hop length. The resultings pectrograms for two seconds of audio are supposed to have 128×800 pixels.
2. The spectrogram images are encoded monochromatically.
3. images were augmented with the zero-crossing rate (ZCR) of each window. This is done by adding ZCRs as another image channel to each spectrogram window(figure 2 in paper makes this clear).

Now there data is sampled at 48kHz. My data is sampled at 44.1 kHz. Besides I am fairly new to speech processing.
So this is what I could come up with after your suggestions:

import numpy as np
import librosa
sr = 44100
frame_length = 4096 
hop_length = 1024
stream = librosa.stream('final.wav', block_length=128, frame_length=frame_length, hop_length=hop_length)

mel_specs_log_zcr = []
for y in stream:
    mel = librosa.feature.melspectrogram(y, sr=sr, center=False, frame_length=frame_length, hop_length=hop_length)
    log_mel=librosa.power_to_db(mel)
    zcr = librosa.feature.zero_crossing_rate(y, hop_length=hop_length, frame_length=frame_length)
    mel_spec_log_zcr = np.concatenate(log_mel, zcr), axis=0)
    mel_specs_log_zcr.append(mel_spec_log_zcr)


I'd really be grateful if you or anyone could look at my code and let me know if this correctly represents the implementation in the paper as I have tried to describe above. If not please suggest edits

Thanks again.

Brian McFee

unread,
Jun 30, 2021, 12:10:48 PM6/30/21
to librosa
Oh, I see.  I figured you intended to concatenate the zcr feature as another channel, resulting in a (128+1)x800 array. 

Replicating the zcr feature 128 times seems a bit wasteful to me, but if that's what you want, then the output should be of shape 2x128x800 (or some permutation thereof).  The tiling you originally had should be fine here.  You'll want to use np.stack instead of concatenate so that a new axis is created, but otherwise things look like they should work.
Reply all
Reply to author
Forward
0 new messages