Dear AV1 team,
Looking for an interesting problem in AV1 for a lone theoretist, a natural choice was PVQ (any others?). In the past few days I have found and tested an essential improvement, yesterday wrote it down and just submitted to arxiv: https://www.dropbox.com/s/jgghkeov1swu53x/pvq.pdf
So the PVQ “natural choice” of radial projection between L1 and L2 unit spheres (“just normalize”) looked highly suspicious for being the optimal one – I thought that a more subtle transformation should distribute quantization points more uniformly – reducing MSE.
While finding the really optimal one seems an extremely difficult task, a simple parametrized family “power projection”:
x_i -> normalize((x_i ^p)_i)
for optimized parameter p (like 1.3) allows for really huge improvements – typically above 10% reduction of MSE for K > L/2. I see a way to try to improve it with a family having more parameters – I will probably squeeze a bit more here.
PVQ for K < L/2 has just terrible performance – degenerates to something what can be easily beaten with different quantizers (I have found up to 3dB improvement)… but to do it right will need some further work – let me know if it is worth the effort? What are typical (L,K) used?
Related topic is that I think a major obstruction for using PVQ in video compression was told to be extreme computational cost? If so, the main source is probably the costly combinatorial enumerative coding for L1 sphere?
This problem should be repairable by replacing it with a fast accurate entropy coder, by the way handling the issue of working with fractional bits while storing the final PVQ number (bounded by not a power of 2) – let me know if it is worth working on?
Alternatively, you can use L-infinity sphere (instead of L1), which is much cheaper to encode (just Z^L lattice), but leads to worse MSE (L1 lattice is closer to sphere packing).
And generally I wanted to ask about the summary of current status of PVQ in AV1?
Best regards,
Jarek Duda
Interesting. Do you have some images of what the points in your best
codebook looks like for L=2 and L=3? One minor comment about your
intro... The Householder reflection part of PVQ only applies to
Daala/AV1 because we came up with it after Opus was frozen. It should be
applicable to audio too, but it would require careful application to
avoid "desynchronization" on rounding errors and on packet loss.
It all rests on the definition you use for "optimal". Every codebook is
optimal for some distribution of the unquantized data. Your implicit
assumption (as far as I can tell) is that the data we are quantizing is
Gaussian-distributed, which means that we want uniform codewords on the
surface of the L-sphere.
In the case of audio MDCT coefficients, as far as I can tell the
distribution is closer to being Laplacian. In addition, the regions in
coefficient space where the PVQ codebook has the highest density are the
ones where there is one or two coefficients that are much larger than
the others. That corresponds to tones -- for which we *do* want better
resolution because noise is more perceptible. So while the codebook is
far from being "optimal" in the sense most commonly considered for
coding theory, it's pretty good for audio.
In the case of video, IIRC (it's been a while) the non-predicted video
DCT coefficients were also Laplace-like, but I don't remember really
measuring with the prediction. In the case of video, the main drawback I
can see with your current codec formulation is computing the power
itself. While it can easily be done with a lookup table, table lookups
cannot be vectorized on most CPUs. This would not be much of an issue
for an audio codec like Opus, but for video (where vecotrization is
critical), it would likely be a big deal.
> While finding the really optimal one seems an extremely difficult task,
> a simple parametrized family “power projection”:
>
> *x_i -> normalize((x_i ^p)_i)*
Do you have a refinement step after the projection? Especially for
Daala/AV1 we place the last few "pulses" using RDO (see below about rate).
It's really all over the place depending on the encoding quality, the
frequency band, ... In Opus, the value of L ranges from 2 all the way
up to 176, with K ranging from 1 up to 256. When K would be too large,
or if the codebook would be more than 32 bits, we actually split the
vector (before quantization) in two so that indexing only works on 32
bits. The corresponding "bit depth" ranges from close to 0, up to around
8 bits/coefficient. For low bit depths, we apply a "spreading rotation"
to avoid the output sounding too tonal.
For video, L ranges from 8 to 128, while K is theoretically unbounded.
We tend to go up to about 8 bits/coef of depth, but < 1 bit/coeff is
more typical.
Actually, the enumeration is only done for Opus because we want
predictable rates and the coefficients we're coding together in a vector
tend to have similar variances. This is not true of video, where the
coefficient with the lowest frequency in the vector tends to have a much
higher variance than the coefficients with the highest frequency in that
band -- even though we already split the coefficients into many bands
(which we also do for audio). This means we have to figure out a way to
more efficiently code the PVQ codeword to take advantage of the
difference in variance. The first scheme we used is described in this:
https://jmvalin.ca/notes/pvq_encoding.pdf
but it was too slow. What we're currently using is a scheme where
knowing K and L, we split the vector (after quantization this time) in
two, code how K is split between the two parts, and continue
recursively. The distribution of how to split K is modeled so that we
can learn the difference in variance.
Of course, the new coder is also far from perfect -- in part because it
requires many adaptation contexts -- theoretically one for each K/L
pair, but we take shortcuts on larger Ks.
Of course when it comes to decoder complexity, there's also the issue
that we have to compute the forward DCT of the prediction (which you
don't need for scalar). On the encoder side, the main issue with
complexity is the search, and the fact that AV1 runs many many codebook
searches in its RDO look (IIRC about 1000x more than Daala for the same
image).
So MSE is generally fine even when you're optimizing the codec for
subjective quality because the "perceptual" part is usually handled at a
higher level, e.g. by setting K or by optimizing block size decisions.
The low-level codebook search itself is usually MSE-based.
As I said earlier, the important thing is the distribution of the data.
I think many sources are closer to being Laplace-distributed than
Gaussian distributed, but there's nothing that beats real data. This is
especially true for video because it's not just the shape of the
distribution, but how the variance changes for different coefficients.
Even the best possible Laplace quantizer would perform significantly
worse than dumb scalar quantization if it doesn't take into account the
difference in variance (we've learned that the hard way).
You might actually have better luck with (low-order) polynomials. While
multiplications aren't cheap, at least they vectorize well. With 16-bit
precision, AVX2 will let you compute 16 multiplies per cycle, while
you'd only be able to perform one lookup in that same cycle.
> My question here if you often have the low K case: K < L/2 ?
It's really all over the place sometimes you have pretty large Ks, but
small Ks are also very common. One of the most commons is actually K=1,
which I *believe* is actually optimal with the normal codebook, even if
K>1 isn't. Of course, the most common of all is K=0 because the only way
to efficiently compress video is to not code anything for as many blocks
as possible.
Again, be careful about the distribution. Using Gaussian distributions
is nice for developing theoretical calculations and makes nice papers,
but it doesn't give you a better video codec.