How quickly calculate the entropy for long vectors ? -sum(x->x*log(2,x), [count(x->x==c,s)/length(s) for c in unique(s)])

258 views
Skip to first unread message

paul analyst

unread,
Sep 5, 2014, 12:42:20 AM9/5/14
to julia...@googlegroups.com
julia> entropy(s)=-sum(x->x*log(2,x), [count(x->x==c,s)/length(s) for c in unique(s)]);

julia> s=rand(10^3);

julia> @time entropy(s)
elapsed time: 0.167097546 seconds (20255140 bytes allocated)
9.965784284662059

julia> s=rand(10^4);

julia> @time entropy(s)
elapsed time: 3.62008077 seconds (1602061320 bytes allocated, 21.81% gc time)
13.287712379549843

julia> s=rand(10^5);

julia> @time entropy(s)
elapsed time: 366.181311932 seconds (160021245832 bytes allocated, 21.89% gc time)
16.609640474434073

julia> s=rand(10^6);

julia> @time entropy(s)
................................
After 12 h not yet counted :/

Paul

David P. Sanders

unread,
Sep 5, 2014, 1:34:34 AM9/5/14
to julia...@googlegroups.com


El jueves, 4 de septiembre de 2014 23:42:20 UTC-5, paul analyst escribió:
julia> entropy(s)=-sum(x->x*log(2,x), [count(x->x==c,s)/length(s) for c in unique(s)]);

julia> s=rand(10^3);

julia> @time entropy(s)
elapsed time: 0.167097546 seconds (20255140 bytes allocated)
9.965784284662059

julia> s=rand(10^4);

julia> @time entropy(s)
elapsed time: 3.62008077 seconds (1602061320 bytes allocated, 21.81% gc time)
13.287712379549843

julia> s=rand(10^5);

julia> @time entropy(s)
elapsed time: 366.181311932 seconds (160021245832 bytes allocated, 21.89% gc time)
16.609640474434073


You can see from these last two that the time is multiplied by 100 when the length of the vector is multiplied by 10, i.e. your method has O(n^2) complexity. This is due to the way that you are counting repeats. What you are basically doing is a histogram. 

If your data are really floats, then in any case some binning is required. If they are ints, you could use a dictionary. I think there's even a counting object already implemented (but I don't remember what it's called).

How about this:

function entropy(s)
           N = length(s)
           num_bins = 10000
           h = hist(s, num_bins)
           p = h[2] ./ N  # probabilities
           -sum([x * log(2,x) for x in p])
end

julia> @time entropy(rand(10^6))
elapsed time: 0.199634039 seconds (79424624 bytes allocated, 31.51% gc time)
13.28044771568381

julia> @time entropy(rand(10^7))
elapsed time: 1.710673571 seconds (792084208 bytes allocated, 26.20% gc time)
13.286992511965552

julia> @time entropy(rand(10^8))
elapsed time: 18.20088571 seconds (7918627344 bytes allocated, 24.03% gc time)
13.28764216804997

The calculation is now O(n) instead.

Paul Analyst

unread,
Sep 5, 2014, 3:16:18 AM9/5/14
to julia...@googlegroups.com
THX, work but:
julia> t=h5read("FMLo2_z_reversem.h5","FMLo2",(:,1));

julia> eltype(t)
Float64

julia> t
5932868x1 Array{Float64,2}:
  0.0181719
  0.303473
 -0.526979
  ?
 -0.526979
  0.912295
 -0.0281875

julia> entropy(t)
NaN

julia> entropy(vec(t))
NaN
Why ?
julia> s

1000000-element Array{Float64,1}:
 0.737228
 0.162308
 0.688503
 0.108727
 0.40552
 0.654883
 0.618194
 0.137147
 0.934373
 0.077236
 ?
 0.413675
 0.463914
 0.504321
 0.976408
 0.998662
 0.343927
 0.477739
 0.660533
 0.918326
 0.579264

julia> entropy(s)
13.280438296634793


julia>

julia> entropy(t[:,1])
NaN

???
Paul


W dniu 2014-09-05 07:34, David P. Sanders pisze:

Gunnar Farnebäck

unread,
Sep 5, 2014, 4:46:10 AM9/5/14
to julia...@googlegroups.com
You get NaN because the binning has left some bin empty and x * log(2, x) for x = 0.0 results in 0.0 * -Inf which is NaN. You can avoid this particular problem by replacing x * log(2,x) with something that special cases x = 0.0 to return 0.0 rather than NaN.

Note that the result you get from this operation will depend on the number of bins and will typically be different from the result of your first approach.

Sebastian Nowozin

unread,
Sep 5, 2014, 4:50:27 AM9/5/14
to julia...@googlegroups.com

Hi Paul,

independent of the speed issue:
it seems you are trying to estimate the discrete entropy from frequencies, as opposed to compute the entropy of a categorical distribution.

The naive plugin frequency estimator you use is a bad estimator, and you may want to consider using an entropy estimator with better statistical properties.
There is a great decision chart here: http://memming.wordpress.com/2014/02/09/a-guide-to-discrete-entropy-estimators/

In practice I found the Grassberger 2003 estimator to be a good tradeoff in terms of speed and statistical quality.  A closed form formula is given as equation (6) in http://www.nowozin.net/sebastian/papers/nowozin2012infogain.pdf
In terms of best statistical accuracy, the NSB Bayes estimator is probably the best over a broad range of input distributions.

Good luck,
Sebastian
Reply all
Reply to author
Forward
0 new messages