Paper for povey window

544 views
Skip to first unread message

zlin

unread,
Apr 1, 2019, 6:19:28 AM4/1/19
to kaldi-help
Excuse,
Does there are some paper about povey window function?
I am writing my paper, povey window are used in my feature extraction,
so I want to cite the paper about povey widnow
But I didn't found it,
Could you please share with me?
Thank you .

Daniel Povey

unread,
Apr 1, 2019, 12:56:59 PM4/1/19
to kaldi-help
There isn't really a paper- I suppose you can refer to the codebase or the Kaldi paper (?)
Anyway it's not rocket science.  Calling it the "Povey window" was more a joke than anything else; it's a window function that looks mostly like the Hamming window but which goes to zero smoothly at the edges (to stop the high-energy low-frequency components bleeding into the high-frequency parts of the spectrum).  I just couldn't bring myself to use a window function which has a discontinuity at the edges, as it seems wrong for this application.

You can see the definition in the code.

Dan

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/90b9baa5-480d-44dc-a65a-20d3bdf49ff6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Omid Sadjadi

unread,
Dec 2, 2021, 6:38:32 PM12/2/21
to kaldi-help
> Calling it the "Povey window" was more a joke than anything else;

This joke has unfortunately been taken so seriously that "Povey window" is now part of torchaudio.

As I wrote to you and Yenda in 2017, firstly, the window function described above is Hann (aka Hanning) not Hamming. Secondly, the Hanning window does taper off to zero at both edges. Concretely,

w(0) = 0.5 - 0.5 * cos(2 * pi * 0/(N-1)) = 0.0
w(N-1) = 0.5 - 0.5 * cos(2 * pi * (N-1)/(N-1)) = 0.0

I'm also attaching the magnitude responses for Hamming, Hanning, and "Povey" windows. I hope people find it insightful.

Thanks again for your continuous support for Kaldi!
window_functions.png

Dan Ellis

unread,
Feb 3, 2022, 5:49:56 PM2/3/22
to kaldi-help
Some comments:

Short-time windowing is often interpreted in terms of its impact in the Fourier (frequency) domain; multiplying by a finite-length window in time is equivalent to convolving (i.e., blurring) the frequency response with the Fourier transform of the window - the magnitude responses shown in Omid's message.  

This is consistent with a more intuitive motivation for tapered windows like the raised-cosine (Hann): Smoothly tapering a waveform segment to zero at its edges avoids the discontinuity we'd otherwise likely get. Discontinuities in time correspond to spreading (blurring) spectral energy across the whole frequency range -- a/k/a "spectral splatter" -- which, if you listen to it, sounds like an audible click.

However, the smooth,  but finite-duration raised-cosine window does not have a monotonic magnitude spectrum; its broadened central "bump" is surrounded by multiple secondary bumps - sidelobes - separated by frequencies where the magnitude is zero (notches).  Sidelobes are unpleasant because they introduce local maxima in the windowed spectrum which are not centered on the frequency component that caused them.

Even the spectral splatter caused by a rectangular window (no tapering) has a sidelobe structure.  The largest sidelobe, at around 3pi/L rad/samp away from the center (for an L point window) is about 13 dB below the main lobe peak.

Hann-window tapering reduces this worst sidelobe peak to better than 31 dB below the main peak.  This comes, however, at a cost - the main lobe itself is twice as wide (twice the blurring), so the first sidelobe is now around 5pi/L rad/samp.

The Hamming window reintroduces a little bit of discontinuity (a "pedestal" below the raised-cosine) which manages to cancel some of the peak of the worst Hann sidelobe.  As a result, the Hamming window has a worst-case sidelobe almost 43 dB below the mainlobe.  It doesn't make the mainlobe any wider than with Hann, *but* it does introduce spectral splatter: whereas the Hann sidelobes continue to decrease as you get further away from the center frequency, the Hamming sidelobes die out much more slowly, as can be seen in Omid's figure.  

So essentially it's a compromise between the size of the sidelobes very close to the main lobe (which Hamming makes more than 10 dB better than Hann) and the sidelobes far away (which Hamming makes much worse, maybe 50 dB worse in Omid's figure).  This can be particularly damaging if you're trying to be sensitive to low-energy components in a signal with other, high-energy components which are far removed in frequency.  Speech without pre-emphasis fits this, with the low-frequency voicing often 40 dB+ more intense than the high frequency.  

Dan's objection to the widely-used Hamming stems, I think, from the small discontinuity due to the pedestal, and the consequent spectral splatter.  So the "Povey window" stays pretty close to the Hamming window in the time domain, except at the extremes, where it smoothly tapers to zero.  However, this doesn't manage to preserve the first-sidelobe suppression of the original Hamming, which actually *relied* on the spectral splatter to cancel the lobe.  Here's a plot of the detail of the magnitude responses for Rectangular, Hann, Hamming, and Povey windows, right around the mainlobe and early sidelobes:

 window_mag_responses.png
You can see how Hamming (green) has reduced the first sidelobe (around \omega=0.010) compared to Hann (orange), but the later green sidelobes don't decay much. 

Unfortunately, the Povey window (red) doesn't appear to give any advantage over Hann, except for a marginally narrower mainlobe (which surprised me).  On the whole, I think a plain raised-cosine (Hann) is a better choice, although I'd be surprised if there was any meaningful difference between them on a downstream task.

  DAn.

Daniel Povey

unread,
Feb 3, 2022, 10:06:21 PM2/3/22
to kaldi-help
Interesting analysis.
Bear in mind that the main lobe being a bit narrower (due to wider filter), giving more frequency discrimination, 
might be part of the reason why hamming/povey perform a little better than hann for ASR (IIRC this is what I saw,
although I'm not saying it's a big difference).
Part of my motivation for having the lobes decay faster was that if some parts of the spectrum are much
quieter than others (e.g. higher frequencies are quieter, say, if you didn't do pre-emphasis; of if you didn't do mean subtraction), 
then leakage from distant parts of the spectrum might be a big concern.  This is less of an issue for radio, where nearby channels
would be expected to have about the same signal magnitude as distant channels.
I think I was planning to try removing pre-empasis, which would make this reasoning more salient, but that never happened.

Dan

Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group

---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages