Hallo, There is a new approach to entropy coding (no patents): Asymmetric Numeral Systems(ANS) - the current implementation has about 50% faster decoding than Huffman (single table use and a few binary operations per symbol from a large alphabet), giving rates like arithmetic: https://github.com/Cyan4973/FiniteStateEntropy The idea is somehow similar to arithmetic coding, but instead of two numbers to define the current state (range), it uses only a single natural number (containing lg(state) bits of information) - thanks of this much smaller space of states, we can put the entire behavior into a small table. Here is a poster explaining the basic concepts and the difference from arithmetic: https://dl.dropboxusercontent.com/u/12405967/poster.pdf I would like to propose a discussion about the possibility of applying it in video compression like VP9 - it should be more than 10 times faster than arithmetic coding, providing similar compression rate. The cost is that we need to store coding tables for a given probability distribution: let say 1-16kB for 256 size alphabet. For different contexts (probability distributions), we can build separate tables and switch between them - the memory cost is "the number of different contexts/distributions" times a few kilobytes. So if the number of different contexts used for given coding parameters is smaller than let say a thousand, ANS can provide huge speedup.
I have got two comments regarding memory cost here.
Currently the cost of 1GB SDRAM is about $10, so such a few hundred kB would be a cost below 1 cent, especially that small memories can have smaller addressing, being cheaper and faster. Specialized memory could have less than 16 bit addressing here.
The income of encoding e.g. 256 level quantization DCT coefficient, would be instead of 8 slower (renormalization) AC steps, a single cheap step of ANS – being about 10 times faster, what for hardware implementation means much lower frequencies and energy usage (smartphones…).
The minimal memory cost for 256 size alphabet is 1kB (assuming no symbol above 1/4 probability, 1.25kB general): working about 0.001 from theoretical rate limit for coding tables optimized for given probability distribution (as part of coder) – using a few dozens of quantization types (probability distributions), the memory cost would be only a few dozens of kilobytes.
Thanks of much simpler renormalization than arithmetic coding, replacing it with binary ANS (e.g. qABS) should be also a speedup – this time working about 0.001 from theoretical rate limit would cost below 1kB of tables total.
Thanks, Jarek
A few hundred kilobytes is actually quite a bit of silicon. Our implementation of the entire VP9 hardware decoder for 4k 60 fps resolution needs currently about 240 kB SRAM for all the data it needs to store (including line buffers for motion vectors, pixels for in-loop filtering and intra prediction etc.). It sounds like the ANS solution could more than double that, so in that perspective it is an expensive solution.
Thanks,Aki Kuusela
On Tuesday, January 7, 2014 6:56:50 PM UTC+2, dud...@gmail.com wrote:Hallo, thanks for the reply.
I am rather a theory person and video compression is something new for me - I just wanted to discuss such eventual applicability.
Indeed the memory need could be an issue, but it is just a few kilobytes per table/context, so for all contexts there would be needed a few hundreds kilobytes - even for hardware implementation, the cost of this amount of memory is a small fraction of a cent.
Especially looking at advantages - large symbol processed in a single table use - here is example of the whole step for symbol processing (including renormalization):
step = decodeTable[state];
symbol = step.symbol;
state = step.newState | (bitStream & mask[step.nbBits]);
bitStream = bitStream >> step.nbBits;
where mask[b]=(1<<b)-1
I imagine that video compression is lots of binary adaptive choices - where we should work bit by bit, but there is also lots of encoding of DCT coefficients as large independent numbers, just a few different ranges/contexts (?) - if so, ANS would be perfect for such static distributions on large alphabets.
However, there are also lots of different ways of use for this new approach (table on page 4 of new version of http://arxiv.org/pdf/1311.2540 ) - for example some of these tables (let say first 32 or 64) could be for quantized probabilities for binary alphabet - we can freely switch between tables symbol by symbol.
In the binary case, tabled or using using formula with single multiplication per symbol, ANS still should have advantage over AC - as you can see, it has much simpler renormalization: instead of branching bit by bit, with issue when the range contains the middle point, in ANS we can just transfer the required number of bits once per symbol.
--
You received this message because you are subscribed to the Google Groups "Codec Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to codec-devel...@webmproject.org.
To post to this group, send email to codec...@webmproject.org.
Visit this group at http://groups.google.com/a/webmproject.org/group/codec-devel/.
For more options, visit https://groups.google.com/a/webmproject.org/groups/opt_out.
Hi skal, all,
First of all, indeed I was told that DeltaH is about 20% larger in experiments.
However, as I write in the paper, I calculate it purely analytically: assume i.i.d. input source, find stationary probability distribution of used states (dominant eigenvector), what allows to find the expected number of used bits per symbol. Then subtract Shannon entropy.
Here is source of Mathematica procedure I was using: https://dl.dropboxusercontent.com/u/12405967/ans.nb
I interpret the difference as mainly caused by imperfect source.
Another issue is spreading the symbols - the better it is done, the smaller deltaH.
The github implementation uses some approximated symbol distributing algorithm - tests in the paper were made for precise initialization (using procedure above).
What symbol distribution are you using?
Tests for precise initialization for 256 size alphabet and probabilities approximated as l_s/512, gave deltaH < 0.01 bits/symbol for 512 states, what requires 512 * (1 byte for symbol + 1 byte for new state) = 1kB table.
Symbol here is almost 8 bits, so this loss means about 0.00125 rate loss.
And for fixed stationary probability distribution, shifting symbol appearances we can "tune" it to be optimal for different probability distribution - such optimal distribution can be a part of coder.
Thanks,
Jarek
tableDecode[cp++].symbol=s}
Hi skal, all,
Great thanks for the tests!
It is really surprising that all these complex arithmetic coding renormalizations can be made faster than a few fixed binary operations ... but they are just first ANS implementations, so maybe there is still a place for improvement...
The dependence of speed from probability will be larger while using direct ABS formulas and large b to gather accumulated byte or two at once, like in Matt Mahoney fpaqc.
The large alphabet version is perfect for large stationary probability distributions, like DCT coefficients.
However, we can also use its huge speed advantage for binary case when probabilities change from bit to bit - by grouping symbols of the same quantized probability.
Let say we use 64 different quantizations of probability of the least probable symbol (1/128 precision) - so we can use 64 of 8bit buffers, add binary choices bit by bit to corresponding buffers, and one of 64 ANS coding tables for 256 size alphabet every time a buffer accumulates to 8 binary symbols.
All these 64 tables can work on the same state - decoder knows symbols of which probability run out.
About the required table size, I see you have indeed used the github version - I have just tested it and it gives a few to hundreds times larger deltaH (found analytically) than precise initialization, and generally is very capricious.
It was made to be fast, but better symbol spread would need a few times smaller coding tables.
Here is a sketch of fast approximated precise symbol spread (I am not good at C) by bucketing the search for the most appropriate symbol (can be improved by increasing the number of buckets) - it should need much smaller tables for the same accuracy:
int first[nbStates]; // first symbol in given bucket
int next[nbSymbols]; // next symbol in the same stack
float pos[nbSymbols]; // expected symbol position, can be U16 instead
for(i=0; i < nbStates; i++) first[i]=-1; //empty buckets
for(s=0; s < nbSymbols; s++)
{pos[s]=ip[s]/2; //ip[s]=1/probability[s];
t = round(ip[s]/2); //which bucket
next[s]=first[t]; first[t]=s;} //push to stack t
cp=0; //position in decoding table
for(i=0; i < nbStates; i++)
while((s=first[i]) > = 0)
{first[i]=next[s]; // remove from stack
t = round(pos[s]+=ip[s]); // new expected position
next[s]=first[t]; first[t]=s; // insert in new positiontableDecode[cp++].symbol=s}
Hi skal, all,
Thank you for the tests. It looks really really bad, so I have just checked precise initialization for the worse case in your graph: p=0.1.
This time, instead of finding stationary probability like before, I just applied the encoder for 10^6 symbol random sequences: https://dl.dropboxusercontent.com/u/12405967/ans1.nb
So for 2^12 states, ANS needs ~1.01 of what Shanon says (sum lg(1/p) over symbols), as h~3.752 bits/symbol for this binomial distribution, it means deltaH~0.04 bits/symbol.
For 2^13 states, it needs ~1.004 of what Shannon says, what means deltaH ~ 0.015 bits/symbol.
For 2^14 states, it needs ~1.00165 of what Shannon says, deltaH ~ 0.0062 bits/symbol
p Proba| h FSC | h FSC8 | h BC | h theory
0.1019608 0.4767120 0.4760080 0.4750568 0.4747218
which is deltaH ~= .002
The last value is about twice smaller than in your graph for p=0.1 - so you could improve it about twice by not approximating precise initialization (my approximation is a bit better, as it doesn't round expected positions).
Also the question is how to approximate probabilities with "number of states" denominator fractions.
Probably we could improve it further by tuning - for example by shifting extremely low probable symbols to the end.
Still these values are much worse than in the paper - mainly because, as I have emphasized, I didn't consider the approximating of probabilities with "number of states" denominator, what is the real problem here - lots of symbols having much smaller probability.
There has just appeared Fabian Giesen paper with practical remarks for ANS implementation (especially formula variants): http://arxiv.org/abs/1402.3392
--
Hi skal, all,
I have recently found fast "tuned" symbol spread for tANS while working on toolkit to test various quantization and spread methods ( https://github.com/JarekDuda/AsymmetricNumeralSystemsToolkit ): which does not only use q[s] ~ L*p[s] quantization of probability distribution (number of appearances), but also exact values of probabilities p[s].
For example for 4 symbol and L=16 states:
p: 0.04 0.16 0.16 0.64
q/L: 0.0625 0.1875 0.125 0.625
this quantization itself has dH/H ~ 0.011 rate penalty
spread_fast() gives 0233233133133133 and dH/H ~ 0.020
spread_prec() gives 3313233103332133 and dH/H ~ 0.015 - generally it is close to quantization dH/H
while spread_tuned(): 3233321333313310 and dH/H ~ 0.0046 - better than quantization dH/H due to using also p
Tuning shifts symbols right when q[s]/L > p[s] and left otherwise, getting better agreement and so compression rate.
The found rule is: preferred position for i \in [q[s], 2q[s]-1) appearance of symbol s is x = 1/(p[s] * ln(1+1/i)).
Returning to video compression, I have a problem finding VP9 documentation, but here ( http://forum.doom9.org/showthread.php?t=168947 ) I have found: "Probabilities are one byte each, and there are 1783 of them for keyframes, 1902 for inter frames (by my count)".
So you are probably using 256 size alphabet range coder for a lot of fixed probabilities - rANS is its many times faster direct alternative ( https://github.com/rygorous/ryg_rans ):
- it uses single multiplication per decoding step instead of 2 for range coding,
- its state is a single natural number, making it perfect for SIMD parallelization,
- it uses simpler representation of probability distribution, for example allowing for alias method to reduce memory need ( http://fgiesen.wordpress.com/2014/02/18/rans-with-static-probability-distributions/ ).
With alias method:
tANS is often even faster (FSE has ~50% faster decoding than zlib Huffman). Thanks to tuning, I believe L=1024 states should be sufficient for most situations (eventually grouping low probable symbols: below 0.7/L, into an exit symbol decoded in succeeding step) - you need 1kB to store such symbol spread and can nearly instantly generate a 3-4kB decoding table for a given one.
--
You received this message because you are subscribed to the Google Groups "Codec Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to codec-devel...@webmproject.org.
To post to this group, send email to codec...@webmproject.org.
Visit this group at http://groups.google.com/a/webmproject.org/group/codec-devel/.
For more options, visit https://groups.google.com/a/webmproject.org/d/optout.
This thread needs some updates:
- discussed above Pascal's implementation (recently got speedup thanks to interleaving): https://github.com/skal65535/fsc
- discussion about binary cases, where ANS turns out also providing speedup: http://encode.ru/threads/1821-Asymetric-Numeral-System/page8
Here are decoding step from Pascal's listing for 32bit x, 16bit renormalization, 16bit probability precision: p0=prob*2^16 \in {1, ..., 2^16 - 1} for two variants.
rABS variant:
uint16_t xfrac = x & 0xffff;
uint32_t x_new = p0 * (x >> 16);
if (xfrac < p0) { x = x_new + xfrac; out[i] = 0;
} else { x -= x_new + p0; out[i] = 1; }
uABS variant (very similar to Matt Mahoney fpaqc):
uint64_t xp = (uint64_t)x * p0 - 1;
out[i] = ((xp & 0xffff) + p0) >> 16; // <= frac(xp) < p0
xp = (xp >> 16) + 1;
x = out[i] ? xp : x - xp;
uint64_t xp = (uint64_t)x * q0;
out[i] = ((xp & (K - 1)) >= p0);
xp >>= L_pb;
x = out[i] ? xp : x - xp;
plus renormalization (much simpler than in arithmetic coding):
if (x < (1 << 16)) { x = (x << 16) | *ptr++; }
- list of available implementations of ANS: http://encode.ru/threads/2078-List-of-Asymmetric-Numeral-Systems-implementations
- another compressor has switched to ANS: LZA, moving to the top of this benchmark: http://heartofcomp.altervista.org/MOC/MOCADE.htm (zhuff = LZ4 + tANS is at the top for decoding speed there).
--
It has just turned out that somebody is trying to patent something looking very general about ANS:
"System and method for compressing data using asymmetric numeral systems with probability distributions"
https://www.ipo.gov.uk/p-ipsum/Case/ApplicationNumber/GB1502286.6
I wanted to make us cover all possibilities to prevent patenting, I cannot even find its text.
As it could complicate the use of ANS, I thought that it would be also of Google interest to prevent that ...
ps. Very interesting Yann Collet ANS-based compressor: https://github.com/Cyan4973/zstd , https://www.reddit.com/r/programming/comments/2tibrh/zstd_a_new_compression_algorithm/
ps2. ANS will be presented at PCS 2015 in two weeks - paper (nothing new but understood by reviewers): https://dl.dropboxusercontent.com/u/12405967/pcs2015.pdf
--
You received this message because you are subscribed to the Google Groups "Codec Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to codec-devel...@webmproject.org.
To post to this group, send email to codec...@webmproject.org.
Visit this group at https://groups.google.com/a/webmproject.org/group/codec-devel/.
Hi Alex,
I wanted to take a closer look at VP9/10 to think about optimizations, but I cannot find documentation. The best sources I could find are:
http://files.meetup.com/9842252/Overview-VP9.pdf
http://forum.doom9.org/showthread.php?t=168947
Are there some better available? Could you summarize made/planned changes for VP10?
The main advantage of ANS is accurate handling of large alphabet – there are many places including compressed header, prediction modes, motion vectors, etc. … however, the most significant are DCT coefficients – it is indeed a natural first step, but finally it might be beneficial to complete change into switching between ANS and ABS (probably rANS and uABS, but can be also realized with tabled variants – without multiplications).
From the decoder point of view, which is the main priority in YouTube-like applications, such change means improvement of performance (including uABS vs AC), and simplification (no binarizing) – I don’t see any additional issues with parallelization here?
From the encoder point of view, there would be needed additional buffer for probabilities if adaptively modifying probabilities. However, at least in VP9, probability model seems to be fixed within a frame (still true in VP10?) – eventual adaptation is made for successive frame. In such case we have just context-base modelling (like in Markov) – no additional buffer is needed: encoder goes backward, but uses context from the perspective of later forward decoding. In current (software) implementations of ANS-based compressors, entropy coding is usually split from modelling phase to use boost from interleaving. In any case, I don’t see additional issues regarding parallelization?
See Fabian’s post about dynamical switching between various streams: https://fgiesen.wordpress.com/2015/12/21/rans-in-practice/
Let’s look at the DCT/ADST, if I properly understand, you use symmetric Pareto probability distribution:
cdf(x) = 0.5 + 0.5 * sgn(x) * [1 - {alpha/(alpha + |x|)} ^ beta
for beta=8 and alpha determined by context – one of 576 combinations of: {Y or UV, block size, position in block, above/left coefficient}, using 8 bit probabilities – originally for branches of a binary tree.
Using large alphabet entropy coder, you can have prepared coding tables for quantized alpha, and choose one accordingly to the context – it seem this is what you are doing (?), also rANS is perfect here.
However, it seems you have remained 8 bit probabilities – large alphabet might need a better resolution here, and there is no problem with it while using rANS. I see you use 32 bit state, and 8 bit renormalization – there is no problem with using even 16 bit probabilities in this setting.
There are probably possible optimizations here, tough questions like:
- What family of probability distributions to choose (I thought it is usually used two-sided exponential: rho(x) ~ exp(-alpha|x|)), and context dependencies …
- The context of neighboring coefficients seems extremely important and hard to use effectively – maybe it should depend e.g on (weighted) average of all previous coefficients …
- If high frequencies take only e.g. {-1,0,1} values, you can group a few of them into a larger alphabet,
- The EOB token seems a burden – wouldn’t it better to just encode the number of nonzero coefficients? …
"Q8 test file, order 0:
arith_static 239.1 MB/s enc, 145.4 MB/s dec 73124567 bytes -> 16854053 bytes
rANS_static4c 291.7 MB/s enc, 349.5 MB/s dec 73124567 bytes -> 16847633 bytes
rANS_static4j 290.5 MB/s enc, 354.2 MB/s dec 73124567 bytes -> 16847597 bytes
rANS_static4k 287.5 MB/s enc, 666.2 MB/s dec 73124567 bytes -> 16847597 bytes
rANS_static4_16i 371.0 MB/s enc, 453.6 MB/s dec 73124567 bytes -> 16847528 bytes
rANS_static64c 351.0 MB/s enc, 549.4 MB/s dec 73124567 bytes -> 16848348 bytes"
"
Q8 test file, order 1:
arith_static 189.2 MB/s enc, 130.4 MB/s dec 73124567 bytes -> 15860154 bytes
rANS_static4c 208.6 MB/s enc, 269.2 MB/s dec 73124567 bytes -> 15849814 bytes
rANS_static4j 210.5 MB/s enc, 305.1 MB/s dec 73124567 bytes -> 15849814 bytes
rANS_static4k 212.2 MB/s enc, 403.8 MB/s dec 73124567 bytes -> 15849814 bytes
rANS_static4_16i 240.4 MB/s enc, 345.4 MB/s dec 73124567 bytes -> 15869029 bytes
rANS_static64c 239.2 MB/s enc, 397.6 MB/s dec 73124567 bytes -> 15850522 bytes"
--
You received this message because you are subscribed to the Google Groups "Codec Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to codec-devel+unsubscribe@webmproject.org.
--
Format JPEG PNG DXT1 Crunch GST Time (ms) 848.6 1190.2 85.8 242.3 93.4 Disk size (MB) 6.46 58.7 16.8 8.50 8.91 CPU size (MB) 100 100 16.8 16.8 8.91
while
(x<L) x = x<<1 +
"bit from stream"
; // renormalization
Where normalization can be done directly by just looking at the number of leading zeros - especially for hardware implementation.
coding step ( q[s] = CDF[s+1] - CDF[s] ):
while
(x >= 2 q[s]) {produce(x&1); x >>=1;}
// renormalization
x = x + xs[s]
It is tANS with trivial symbol spread - allowing to use simple formula instead of table, but at cost of some small inaccuracies. In contrast to Huffman, it doesn't operate on power-of-2-probabilites: doesn't have problem with skewed distributions (Pr ~ 1).
If you want, I can compare it with much more complex Moffat/Daala approximation or you can just test it using spread_range in https://github.com/JarekDuda/AsymmetricNumeralSystemsToolkit
So it can be much better than Moffat without multiplication, still directly operating on CDF.
I have just spotted a week old Daala slides ( https://people.xiph.org/~tterribe/pubs/lca2017/aom.pdf ) and couldn’t resist commenting its summary on AV1 entropy coding (slide 22):
“
– daala_ec : the current default
– ans : faster in software, complications for hardware/realtime (must encode in reverse order)
“
1) 1) Surprisingly, it has forgotten to mention the real problem with Daala EC – that it is INACCURATE.
Moffat’s inaccuracy gives on average 2.5% ratio loss, I understand it is reduced a bit by using the tendency of the first symbol to be the most probable one.
However, we could use this tendency in a controlled way instead: e.g. in the initial probability distribution, which optimally should be chosen individually for each context or group of contexts (and e.g. fixed in the standard).
Then after making the best of controlled prediction, there should be used an accurate entropy coder – you could easily get >1% ratio improvement here (I got a few percent improvement when testing a year ago: http://lists.xiph.org/pipermail/daala/2016-April/000138.html ).
And no,
adaptiveness is an obstruction for using rANS ( https://fgiesen.wordpress.com/2015/05/26/models-for-adaptive-arithmetic-coding/ ).
2) 2). Regarding cost, my guess is that
internet traffic is dominated by video streaming like YouTube and Netflix, so
savings there should be the main priority: where you encode once and decode thousands-millions
of times - decoding cost is muuuch more important.
And honestly, the additional cost of rANS backward encoding seems completely negligible here
(?): it requires just a few kilobyte buffer for encoder (can be shared) to
get < < 0.1% ratio loss due to data blocks fitting the buffer. The required tables of probability distributions for hundreds of contexts are already many times larger, and video encoder seems to generally require megabytes of memory for all
the analysis and processing (?).
As you plan billions of users for AV1, it is a matter of time when people will start asking why while we have now modern fast and accurate entropy coding, widely used e.g. by Apple and Facebook, the Google et. al. team has decided to use an old slow and inaccurate one instead?
Leading to comments like "Although brotli uses a less efficient entropy encoder than Zstandard, it is already implemented (...)" from the Guardian ( https://www.theguardian.com/info/developer-blog/2016/dec/01/discover-new-compression-iinovations-brotli-and-zstandard ).
Maybe let’s spare the embarrassment and start this inevitable discussion now – is there a real objective reason for using the slow inaccurate Moffat’s approximation for AV1?
1) 1) Surprisingly, it has forgotten to mention the real problem with Daala EC – that it is INACCURATE.Moffat’s inaccuracy gives on average 2.5% ratio loss, I understand it is reduced a bit by using the tendency of the first symbol to be the most probable one.
However, we could use this tendency in a controlled way instead: e.g. in the initial probability distribution, which optimally should be chosen individually for each context or group of contexts (and e.g. fixed in the standard).
https://register.epo.org/ipfwretrieve?apn=US.201615041228.A&lng=en
… approaching a lawyer’s paradise, where everybody has ANS patent, and nobody can safely use ANS.
It is not that I am completely against software patents, only the pathological obvious ones - with the only purpose to fight competition, create a legal minefield – usually from both sides.
Instead of further inflaming this pathological situation, please let us try to end this madness of ridiculous patents, make basic application of ANS legally unrestricted. Peace.