Another benchmark of entropy coders (FSE is tANS):
http://encode.ru/threads/1920-In-memory-open-source-benchmark-of-entropy-coders
The choice of tANS L state entropy coding automaton consists of:
- quantization of symbol probability distribution as p[s] ~ q[s]/L fractions
- spreading these symbols is range [0, L-1], such that s appears q[s] times
I have created a C++ toolkit to test different choices here, compare with Huffman, getting a few improvements on the way:
https://github.com/JarekDuda/AsymmetricNumeralSystemsToolkit
The most important optimization is finally fast "tuned" symbol spread.
So in current implementations, the symbol spread usually uses only q, slowly getting some basic tuning: shifting low probable single appearance symbols to the end of range (recent FSE).
The new spread_tuned() uses all symbol probabilities in addition to q. For example for 4 symbol and L=16 states:
p: 0.04 0.16 0.16 0.64
q/L: 0.0625 0.1875 0.125 0.625
spread_prec() gives 3313233103332133 and dH/H ~ 0.015
while spread_tuned(): 3233321333313310 and dH/H ~ 0.0046
shifting right symbols for which q[s]/L > p[s] and left otherwise, getting better agreement and so compression rate.
The idea is pretty simple: we know that probability distribution of states should be approximately P(x) \propto 1/x, so we choose expected symbol appearances accordingly to this rule. It is especially important if we would switch between different coding tables, but also gives improvements while using single table.
So P(x) \propto 1/x leads to ~1/ln(2) normalization factor (harmonic numbers). Encoding to some appearance of a symbol is made from a range of values: all x such that after k steps x->floor(x/2), we get i\in [q[s], 2q[s]-1). Probability of such range for our assumption turns out to be ~ ln(1+1/i)/ln(2). Multiplied by p[s], it is probability of obtained state x: should be ~1/x/ln(2), finally getting:
preferred position for i \in [q[s], 2q[s]-1) appearance of symbol s is x = 1/(p[s] * ln(1+1/i))
The spread_tuned() creates lists of preferred symbols for each position, then spreads symbols by reading these lists in order.
It is linear complexity, but might be too slow some applications. For speedup, only low probable symbols can be treated this way (e.g. q=1, maybe 2), and fast spread can be used for the remaining ones.
Another new function is quantize_prec() which finds the best quantization accordingly to sum_s (p[s] - q[s]/L)^2/p[s] approximation of dH. Unfortunately it is n log n.
There is also quantize_fast() which after q[s] = round(p[s]*L), just uses the most probable symbol to repair the balance - it does not always work(!).
I know the toolkit crashes for nasty parameters - it is mostly caused by convergence of the process of finding stationary probability distribution of states to get hANS. It could be replaced by a direct method to find the dominant eigenvector, but it would require storing the whole matrix - the current is linear memory for eventual large L and usually converge fast.
Feel free to add more probability distributions, better quantizers and spreads.