Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Asymmetric numeral systems as close to capacity low state entropy coders

447 views
Skip to first unread message

Jarek Duda

unread,
Nov 12, 2013, 1:45:38 PM11/12/13
to
I have finished a new paper about asymmetric numeral systems (to finally send it to a journal) - completely rewritten, lots of examples, simulations, some meritorious improvements.
Written from perspective of constructing low state entropy coding automate, very close to Shannon limit - we usually need less than 20 states to get below 0.001 bits/symbol loss for a small alphabet (practically unreachable for Huffman coding).

http://arxiv.org/pdf/1311.2540

Have anybody seen some theory of entropy coding finite state automates? It is interesting to see how noninteger numbers of bits flow inside - that state of given stationary probability contains "minus log of this probability" bits of information.
I would be grateful for any comments.

Thomas Richter

unread,
Nov 16, 2013, 4:36:57 PM11/16/13
to
Jarek,

the very first thing (just when reading the abstract) is that this is
actually pretty much the idea behind CABAC (the MPEG encoder). It is not
exactly an arithmetic coder, but instead a state machine driven by
incoming symbols close to what you describe. I suggest you take a
serious look at this work since it seems pretty similar.

Try to look into the following paper:

Context-based adaptive binary arithmetic coding in the H.264/AVC video
compression standard.

Marpe, D. ; Schwarz, H. ; Wiegand, T.
Circuits and Systems for Video Technology, IEEE Transactions on
Volume: 13 , Issue: 7
Digital Object Identifier: 10.1109/TCSVT.2003.815173
Publication Year: 2003 , Page(s): 620 - 636

Hope this helps,

Thomas

Jarek Duda

unread,
Nov 17, 2013, 9:51:27 PM11/17/13
to
Hi Thomas,
Thanks - indeed it turns out that there are practical ways of transforming arithmetic coding into low state automate, like fast/quasi arithmetic coding - I will have to compare with them.
If you want, there is a large discussion about asymmetric numeral systems, including some new simulations: http://encode.ru/threads/1821-Asymetric-Numeral-System
Best,
Jarek

Jarek Duda

unread,
Nov 26, 2013, 12:23:32 PM11/26/13
to
I have just realized that for simple fractions we can replace the largest cost: multiplication, with a few bit shifts and additions.
For example for (3,5)/8 probability distribution let us choose
(00011111)
cyclic symbol distribution for all natural numbers: s=0 if x%8<3.

Now encoding is
C(0,x)=8*floor(x/3) + (x%3)
C(1,x)=8*floor(x/5) + (x%5) + 3

decoding:
if (x%8<3)
{s=0; x=3*floor(x/8) + (x%8) ;}
else
{s=1; x=5*floor(x/8) + (x%8) - 3 ;}

where we can use cheaper:
floor(x/8) = x>>3
x%8 = x & 7
y*8 = y<<3
y*3 = y + y<<1
y*5 = y + y<<2

Another new possibility of ANS application is covering all probabilities with a few automates having the same number of states (for hardware coder or using check up tables) - for example there are needed 5 automates having 5 states for approximately 0.01 bits/symbol loss and about 20 automates with 16 states for approximately 0.001 bits/symbol loss.
Some details can be found in the encode.ru forum, here is a poster gathering basic information: https://dl.dropboxusercontent.com/u/12405967/poster.pdf

Jarek Duda

unread,
Dec 30, 2013, 8:20:29 PM12/30/13
to
I have realized that the method from my previous post can be generalized to larger alphabets - getting alternative for Range Coding, but using only a single multiplication by a small natural number per symbol - should be faster:

Let count[i]/m be probability of symbol i, m = sum_s count[i] and in practice m should be rather a power of 2: m=2^M.
Now choose the symbol distribution with m period: (1..12..2 ... n..n) with count[i] repetitions of symbol i.

The decoding is:
s = sym[state % m];
state = count[s] * (state > > M) + (state % m) - beginning[s];

where sym[] is the corresponding symbol in (1..12..2 ... n..n) and beginning[] is its starting position there.
state % m = state & ((1<<M)-1)

Encoding:
C(s,state) = floor(state / count[s]) < < M + beginning[s] + (state % count[s])

So while tabled ANS was focused on speed, this version needs a few times less memory, can have better accuracy, is much cheaper to modify on the run ... at cost of being a few times slower.

-----------------------------------

Another update is a new implementation of tabled ANS.
Andrew Polar version is available for a few years ( http://ezcodesample.com/abs/ANSCoder.txt ), but the recent Yann Collet implementation is finally focused on the big ANS advantage: speed

https://github.com/Cyan4973/FiniteStateEntropy

THIS IMPLEMENTATION HAS ABOUT 50% FASTER DECODING THAN HUFFMAN, STILL PROVIDING COMPRESSION RATE LIKE ARITHMETIC CODING.
And so in most cases where Huffman is currently used, we can both speed up and improve rate by just replacing it with ANS. The first example is Yann's Zhuff: http://fastcompression.blogspot.fr/2013/12/zhuff-v09-or-first-fse-application.html

This amazing speed is not a surprise, as a single step can be just:

nbBits = decodeTable[state].nbBits;
symbol = decodeTable[state].symbol;
state = decodeTable[state].newState | (bitStream & mask[nbBits]);
bitStream = bitStream >> nbBits;

(mask[nbBits] = (1<<nbBits)-1)
so just a single table use and a few binary operations per symbol - no multiplications or branching. Also perfect for hardware implementation.
Message has been deleted

Jarek Duda

unread,
Jan 16, 2014, 12:37:31 PM1/16/14
to
Here is large independent test of Yann's ANS implementation regarding using it in video compression: https://groups.google.com/a/webmproject.org/d/msg/codec-devel/idezdUoV1yY/T1cZQLnJrZ4J
I think the video compression like Google VP is what will gain the most from ANS, as you want high quality video decompression, not killing e.g. smartphone battery.
We can directly use it for large static distributions like DCT coefficients there ...

... but I have recently realized that we can use its huge speedup (like 8x) also for varying binary choices: by grouping symbols of the same quantized probability into separate e.g. 8bit buffers.
Let say we use 64 different quantizations of probability of the least probable symbol (1/128 precision) - so we can use 64 of 8bit buffers, add binary choices bit by bit to corresponding buffers, and use one of 64 ANS coding tables for 256 size alphabet every time a buffer accumulates to 8 binary symbols.
All these 64 tables can work on the same state - decoder knows symbols of which probability run out.

It seems there is an issue with encoding the remaining values of the buffers when the block ends - we could do it using direct ABS formulas, but we would still need to additionally store the numbers of remaining symbols in each of them, so that decoder knows how many load at start - storing these numbers would require e.g. 3bits * 64 = 24 bytes per block - not very much ...
... but it can be avoided, again at cost of encoding time - this time let us use 64 large buffers, feed them with succeeding binary symbols, and start using 256 alphabet ANS after processing the whole data block - this way decoder can start with loading 8 binary choices to each of 64 buffers (assume at least one, even if it will be not entirely used - the not used positions are filled with the more probable symbols).
It has a small complication at the end of decoding block - decoder has to decide that there has remained less than 8 symbols for given probability to switch to using ABS formulas instead.

Most of current arithmetic coding applications could get huge speed up this way ...

Jarek Duda

unread,
Mar 10, 2014, 9:00:05 PM3/10/14
to
Here is some current benchmark of entropy coders (ANS implementations are currently much faster than Huffman's, being much more accurate): http://encode.ru/threads/1890-Benchmarking-Entropy-Coders

Fabian Giesen has shown that range variant (rANS) can be even faster thanks to using SIMD vectorization - unfortunately at cost of accuracy.
Very interesting is also his using alias method here: http://fgiesen.wordpress.com/2014/02/18/rans-with-static-probability-distributions/
It allows to work with great precision on huge alphabets (like 2^15 in L1 cache).
Additionally, it is cheap for dynamical updating (and accurate) - I think much better than Huffman.
For m size alphabet you have m buckets - each of them contains at most two symbols: the corresponding one and additionally a part of probability of some different symbol ("alias").
So simple updating of probability of m-th symbol can be done e.g. by just shifting the divider in m-th bucket - it is at cost of probability of the alias symbol, but it is usually much more probable symbol so its small changes have tiny impact.
Shifting the divider needs updating starts of further appearances of symbol and alias, but it is cheap (a few values in a list) and we can try to update as far buckets as possible.

Jarek Duda

unread,
Jul 26, 2014, 2:34:29 AM7/26/14
to
Another benchmark of entropy coders (FSE is tANS): http://encode.ru/threads/1920-In-memory-open-source-benchmark-of-entropy-coders

The choice of tANS L state entropy coding automaton consists of:
- quantization of symbol probability distribution as p[s] ~ q[s]/L fractions
- spreading these symbols is range [0, L-1], such that s appears q[s] times
I have created a C++ toolkit to test different choices here, compare with Huffman, getting a few improvements on the way: https://github.com/JarekDuda/AsymmetricNumeralSystemsToolkit

The most important optimization is finally fast "tuned" symbol spread.
So in current implementations, the symbol spread usually uses only q, slowly getting some basic tuning: shifting low probable single appearance symbols to the end of range (recent FSE).
The new spread_tuned() uses all symbol probabilities in addition to q. For example for 4 symbol and L=16 states:
p: 0.04 0.16 0.16 0.64
q/L: 0.0625 0.1875 0.125 0.625
spread_prec() gives 3313233103332133 and dH/H ~ 0.015
while spread_tuned(): 3233321333313310 and dH/H ~ 0.0046
shifting right symbols for which q[s]/L > p[s] and left otherwise, getting better agreement and so compression rate.

The idea is pretty simple: we know that probability distribution of states should be approximately P(x) \propto 1/x, so we choose expected symbol appearances accordingly to this rule. It is especially important if we would switch between different coding tables, but also gives improvements while using single table.
So P(x) \propto 1/x leads to ~1/ln(2) normalization factor (harmonic numbers). Encoding to some appearance of a symbol is made from a range of values: all x such that after k steps x->floor(x/2), we get i\in [q[s], 2q[s]-1). Probability of such range for our assumption turns out to be ~ ln(1+1/i)/ln(2). Multiplied by p[s], it is probability of obtained state x: should be ~1/x/ln(2), finally getting:

preferred position for i \in [q[s], 2q[s]-1) appearance of symbol s is x = 1/(p[s] * ln(1+1/i))

The spread_tuned() creates lists of preferred symbols for each position, then spreads symbols by reading these lists in order.
It is linear complexity, but might be too slow some applications. For speedup, only low probable symbols can be treated this way (e.g. q=1, maybe 2), and fast spread can be used for the remaining ones.

Another new function is quantize_prec() which finds the best quantization accordingly to sum_s (p[s] - q[s]/L)^2/p[s] approximation of dH. Unfortunately it is n log n.
There is also quantize_fast() which after q[s] = round(p[s]*L), just uses the most probable symbol to repair the balance - it does not always work(!).
I know the toolkit crashes for nasty parameters - it is mostly caused by convergence of the process of finding stationary probability distribution of states to get hANS. It could be replaced by a direct method to find the dominant eigenvector, but it would require storing the whole matrix - the current is linear memory for eventual large L and usually converge fast.

Feel free to add more probability distributions, better quantizers and spreads.

Ernst

unread,
Aug 3, 2014, 6:04:04 PM8/3/14
to
I appreciate your post.

What caught my eye is how best to handle those singleton symbols. IE single event symbols that screw up everything :)

I guess what I am asking is what is it you do to assign those low probability events to some encoding?

Naturally I am still learning things so I appreciate the conversation.


Jarek Duda

unread,
Aug 4, 2014, 1:42:58 AM8/4/14
to
Low probable symbols is a common problem and grouping them into so called escape symbol, decoded in succeeding step, is a standard solution. As after grouping they usually still contain relatively low total probability, it nearly doesn't affect speed (but costs memory to define coding for this escape symbol).

Low probable symbols correspond to single appearance in tANS symbol spread/distribution. Depending on position of this appearance, it corresponds to ~1.4/L probability (most left position) to ~0.7/L (most right) - it is the lowest probability L-state tANS can approximate with.

Jarek Duda

unread,
Aug 8, 2014, 4:03:50 AM8/8/14
to
Another compressor has replaced Huffman with tANS - getting improvements in both compression rate and speed:
http://encode.ru/threads/2017-LzTurbo-The-fastest-now-more-faster

Jarek Duda

unread,
Aug 24, 2014, 6:44:59 PM8/24/14
to
And another variant - like rANS, but with the number of multiplication reduced to 0!
The idea comes from Yann's remark that we can see Huffman decoder as a special case of tANS decoder - when symbols are spread in ranges of 2^k size.
So what if we get rid of the "2^k" restriction and use q_s ~ L*p_s quantization for sizes of symbol ranges instead - you can test it in the last version of toolkit.
Unfortunately dH does not decrease to 0 with L as previously - it stops at a slightly better rate than Huffman.

What is interesting about this spreading in ranges it is that it allows to further simplify coding - to single addition and renormalization.
In rANS we have used:
p_s ~ q_s/M quatization (M divides L, both are a power of 2)
encoding:
C(s,x) = M floor(x / q_s) + mod(x , q_s) + b_s
where b_s = q_0 + ... + q_{s-1}
decoding:
s = s(x mod M)
D(x) = (s, q_s (x / M) + mod(x , M) - b_s)
where s is such b_s <= x mod M < b_{s+1}
and we need normalization to return to x \in I = {L,..., bL-1} range.

Now let us take tANS, b=2 for simplicity (can be increased). Use quantization to L denominator:
p_s ~ q_s/L
Spread tANS symbols in ranges: symbol s in {b_s,...b_{s+1}-1} positions
where b_s = L + q_0 + ... + q_{s-1} (larger by L this time)
x is in I = {L , ... , 2L-1} range, symbol s encodes from {q_s, ..., 2q_s - 1} range

decoding step is
[C]s = s(x); // like for rANS, different approaches possible
x = q_s + x - b_s; // yep, that's all
while (x<L) x = x<<1 + "bit from stream"; // renormalization[/C]
encoding s:
[C]while (x >= 2q_s) {produce(x&1); x >>=1;} // renormalization
x = b_s + x - q_s; // the proper coding step[/C]

discussion: http://encode.ru/threads/2035-tANS-rANS-fusion-further-simplification-at-cost-of-accuracy

Jarek Duda

unread,
Nov 13, 2014, 6:53:41 AM11/13/14
to
List of available implementations of ANS (more than 10 authors) - please let me know about updates:

http://encode.ru/threads/2078-List-of-Asymmetric-Numeral-Systems-implementations

Jarek Duda

unread,
Nov 26, 2014, 4:46:02 AM11/26/14
to
While the biggest advantage of ANS is the large alphabet: as accurate and 1.5x times faster replacement of Huffman (already replaced in zhuff and lzturbo compressors) or ~7 times faster replacement of range coding (e.g. LZA compressor) ... it turns out that it can also bring speedup in the binary case while replacing arithmetic coding: up to 2 times faster for arithmetic variants and can be increased 60-70% further by interleaving (of 2 or more independent sources).

Here are decoding steps from implementation of Pascal Massimino for 32bit x, 16bit renormalization, 16bit probability precision: p0=prob*2^16 \in {1, ..., 2^16 - 1}.

rABS variant:

uint16_t xfrac = x & 0xffff;
uint32_t x_new = p0 * (x >> 16);
if (xfrac < p0) { x = x_new + xfrac; out[i] = 0;
} else { x -= x_new + p0; out[i] = 1; }

uABS variant (very similar to Matt Mahoney fpaqc):

uint64_t xp = (uint64_t)x * p0 - 1;
out[i] = ((xp & 0xffff) + p0) >> 16; // <= frac(xp) < p0
xp = (xp >> 16) + 1;
x = out[i] ? xp : x - xp;

plus renormalization (much simpler than in arithmetic coding):

if (x < (1 << 16)) { x = (x << 16) | *ptr++; }

Sources, discussion, test results: http://encode.ru/threads/1821-Asymetric-Numeral-System/page8
0 new messages