strange scaling of slaney mels

184 views
Skip to first unread message

Graham Coleman

unread,
Jan 3, 2024, 9:47:45 AM1/3/24
to librosa
Greetings librosa,

Recently I was working with the mel_frequencies and hz_to_mel functions, and I noticed something that looked off. My intention was to use HTK-style mels, but I left off the parameter somewhere, and I noticed it radically changed the scale of the output mels. 

From what I understood, there should be only small differences in the conversion formulas, whether using HTK or Slaney mel conversions. But as one sees in the figure and output below, the Slaney mel frequencies only get up to the double digits.

I have attached the script, as well as a figure comparing the two.

Am I seeing this correctly? I am working on Windows 10, pip reports that I have librosa 0.10.1

Here are the arrays printed for Slaney, hz and mels:
hz_slaney: [   0.           60.10899199  120.21798398  180.32697597  240.43596796
  300.54495995  360.65395194  420.76294394  480.87193593  540.98092792
  601.08991991  661.1989119   721.30790389  781.41689588  841.52588787
  901.63487986  961.74387185 1022.79221042 1088.20042872 1157.79154456
 1231.83305693 1310.60957153 1394.42389481 1483.59819785 1578.4752548
 1679.41976044 1786.819732   1901.08800068 2022.66379848 2152.01444657
 2289.63715163 2436.06091701 2591.84857616 2757.59895611 2933.94917923
 3121.5771123  3321.20397212 3533.59709775 3759.57290008 4000.        ]
mel_slaney: [ 0.          0.90163488  1.80326976  2.70490464  3.60653952  4.5081744
  5.40980928  6.31144416  7.21307904  8.11471392  9.0163488   9.91798368
 10.81961856 11.72125344 12.62288832 13.5245232  14.42615808 15.32779296
 16.22942784 17.13106272 18.0326976  18.93433248 19.83596736 20.73760224
 21.63923712 22.540872   23.44250688 24.34414176 25.24577664 26.14741152
 27.0490464  27.95068128 28.85231616 29.75395104 30.65558592 31.5572208
 32.45885568 33.36049055 34.26212543 35.16376031]

best regards,

Graham

plot_mel.pdf
plot_mel.py

Brian McFee

unread,
Jan 12, 2024, 2:35:37 PM1/12/24
to librosa
Short answer: I think this looks correct.

Longer answer: the key difference between HTK and Slaney is that HTK uses a fully logarithmic mapping between frequency and mel number, while Slaney is only logarithmic above 1KHz and linear below.  If you were to plot the frequencies directly, ie without using the mel number as the horizontal axis, you'll see that the two curves are pretty close above the 1KHz threshold.

The plot you're making here is a little confusing exactly because of the conversion from Hz to "mel number", which are not exactly comparable between Slaney and HTK definitions.  The hz_to_mel function is mainly there as a helper and complement to mel_to_hz, which are both used internally by the mel_frequencies() function.

Graham Coleman

unread,
Jan 30, 2024, 1:08:56 PM1/30/24
to librosa
Dear Brian,

Thank you for your confirmation. I came to a similar conclusion after seeing the plot on Matlab's hz2mel function. It uses parallel y-axes to show the different scaling of O'Shaughnessy vs Slaney mel units.

It seems that as long as your Hz filter centers have the property of a certain relative spacing in mel space, the scale of mel units could be arbitrary. Thinking about this helped me remove unnecessary detail from a model I am working on.

Still though, I am curious where the explicit conversion of Hz to Slaney mels originates from, as I am unable to find it within either the Auditory Toolbox code or technical report. This is cited with respect to the conversion formula found on Wikipedia, which seems to correspond to the implementation in Rastamat and librosa.

m(f) = \begin{cases}
\frac{3f}{200} & f < 1000 \\
15 + 27 \log_{6.4} \left(\frac{f}{1000}\right) &f\geq 1000
\end{cases}

For example, the code below for computing filter bank centers comes from the Auditory Toolbox, but I interpret this as only implicitly distributing the filters in mel space, but not showing any explicit conversion.

best regards,

Graham

%   Filter bank parameters
lowestFrequency = 133.3333;
linearFilters = 13;
linearSpacing = 66.66666666;
logFilters = 27;
logSpacing = 1.0711703;
fftSize = 512;
cepstralCoefficients = 13;
windowSize = 400;
windowSize = 256;       % Standard says 400, but 256 makes more sense
                % Really should be a function of the sample
                % rate (and the lowestFrequency) and the
                % frame rate.
if (nargin < 2) samplingRate = 16000; end;
if (nargin < 3) frameRate = 100; end;

% Keep this around for later....
totalFilters = linearFilters + logFilters;

% Now figure the band edges.  Interesting frequencies are spaced
% by linearSpacing for a while, then go logarithmic.  First figure
% all the interesting frequencies.  Lower, center, and upper band
% edges are all consequtive interesting frequencies.

freqs = lowestFrequency + (0:linearFilters-1)*linearSpacing;
freqs(linearFilters+1:totalFilters+2) = ...
              freqs(linearFilters) * logSpacing.^(1:logFilters+2);

lower = freqs(1:totalFilters);
center = freqs(2:totalFilters+1);
upper = freqs(3:totalFilters+2);

--
You received this message because you are subscribed to the Google Groups "librosa" group.
To unsubscribe from this group and stop receiving emails from it, send an email to librosa+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/librosa/3466d1c5-4eaa-4a0e-906b-a6ba00add954n%40googlegroups.com.

Brian McFee

unread,
Jan 30, 2024, 4:04:33 PM1/30/24
to librosa
This would have come from the initial port of the rastamat code: https://www.ee.columbia.edu/~dpwe/resources/matlab/rastamat/hz2mel.m

The way librosa developed, we initially translated rastamat (and related) scripts into python, and then refactored and generalized from there.

Dan Ellis

unread,
Jan 30, 2024, 7:04:19 PM1/30/24
to Brian McFee, librosa
I believe the "implicit" definition in the Auditory Toolbox is equivalent to the more explicit formula in librosa (I may be responsible for the translation).

In Matlab, the first block of freqs go from 133.33 to 133.33 + 13 * 66.6667 = 1000 Hz break-frequency.
then the remaining frequencies are exponentially-spaced with each frequency 1.0711703 larger than its predecessor (which preserves the step size at the boundary).
(i.e., 1/1.0711703 = (1000 - 66.6667)).

Obviously, this isn't a coincidence.  Maybe Malcolm Slaney knows where he got those numbers from in the beginning.

  DAn.

Graham Coleman

unread,
Feb 5, 2024, 5:17:50 AM2/5/24
to librosa, Brian McFee, Dan Ellis
Dear Brian and Dan,

Thank you for the additional background of the implementations as well as the explicit conversion formula.

After spending a bit more time with the hz2mel formula, I think I understand it better. Though this should be near-obvious, it seems to map each of the 40 filters (given by parameters of the original Slaney code) to a unit interval in mel space. That is, the formula seems to be an inverse to the implicit mapping of indices to center frequencies in Hz.

This differs from the Shaughnessy/HTK approach, where some small value in Hz is mapped to 0 mels, and 1000 Hz is mapped to 1000 mels.

best regards,
Graham

Reply all
Reply to author
Forward
0 new messages