The nonsymmetric Kullback-Leibler divergence
D{p||q} = -INT(-inf,inf){ dx q(x)*ln(q(x)/p(x)) }
is used to measure the deviation of a model
probability distribution, p(x) from the actual
distribution q(x). This can be interpreted as
the expected value of -ln{q/p}. For discrete
distributions I assume that the corresponding
form is
D{Q||P} = -SUM(i=1:c){ Qi * ln{ Qi / Pi } }.
If the categories are mutually exclusive,
Pj = 1 requires that Pk == 0 for k~= j.
However, Pj = 0 does not force any additional
constraints on the the others.
If Pj = 0 when Qj = 0, it is obvious that
the interpretation should be
Qj * ln(Qj/Pj) = 0
However, what is the interpretation of D when the
model predicts Pj = 0 when Qj ~= 0?
Or, do you have to restrict the model (e.g.,
using logsig or softmax) so that
1. 0 < Pj < 1, (j = 1:c)
2. sum(i=1:c){ Pi } = 1
What changes when the modeled categories are
not mutually exclusive (e.g, short, fat and bald)
and Pj = 0 when Qj ~= 0?
TIA,
Greg
D{q||p} = -INT(-inf,inf){ dx q(x)*ln(q(x)/p(x)) }
> is used to measure the deviation of a model
> probability distribution, p(x) from the actual
> distribution q(x). This can be interpreted as
> the expected value of -ln{q/p}. For discrete
> distributions I assume that the corresponding
> form is
>
> D{Q||P} = -SUM(i=1:c){ Qi * ln{ Qi / Pi } }.
>
> If the categories are mutually exclusive,
> Pj = 1 requires that Pk == 0 for k~= j.
> However, Pj = 0 does not force any additional
> constraints on the the others.
>
> If Pj = 0 when Qj = 0, it is obvious that
> the interpretation should be
>
> Qj * ln(Qj/Pj) = 0
>
> However, what is the interpretation of D when
> the model predicts Pj = 0 when Qj ~= 0?
>
> Or, do you have to restrict the model (e.g.,
> using logsig or softmax) so that
>
> 1. 0 < Pj < 1, (j = 1:c)
> 2. sum(i=1:c){ Pi } = 1
>
> What changes when the modeled categories are
> not mutually exclusive (e.g, short, fat and bald)?
In particular,
1. Is the form of D the same?
2. What happens if Pj = 0 when Qj ~= 0?
TIA,
Greg
> However, what is the interpretation of D when the
> model predicts Pj = 0 when Qj ~= 0?
The KL divergence has an undefined value in this case. Depending on your application, you might define the distance as infinite, because if you take a sequence of distributions converging to P , i.e. P^n-->P, where the distributions P^n are strictly positive for all P^n_j, the KL divergences of P^n will converge to infinite.
> If Pj = 0 when Qj = 0, it is obvious that
> the interpretation should be
>
> Qj * ln(Qj/Pj) = 0
Is it so obvious? Not if you need the KL divergence to behave continuously around Pj=0 or Qj=0.
For example, suppose, Qj=t and Pj=exp(-1/t). Both converge to zero as t-->0, but
Qj * ln(Qj/Pj) =t*ln(t*exp(1/t))
converges to 1
Isn't "converging to infinity" an oxymoron?
Anyway, infinite seems more descriptive than "not defined".
because it can only happen when Qj ~= 0 and Pj = 0..
Hope this helps.
Greg
This example does not fit into the context of a FIXED
true discrete distribution Q = [Q1 Q2 Q3] = [ 2/3 0 1/3]
and a continuum of estimated discrete distributions
P = [P1(t) P2(t) P3(t)] with P2(t) --> 0..
Hope this is clear,
Greg
> Isn't "converging to infinity" an oxymoron?
Well, yes. It seemed a bit easier on the tongue though, than saying "the KL divergence diverges..."
> Anyway, infinite seems more descriptive than "not defined".
> because it can only happen when Qj ~= 0 and Pj = 0..
As I say, you can define it that way if it suits you. It's not always useful to do so...
> This example does not fit into the context of a FIXED
> true discrete distribution Q = [Q1 Q2 Q3] = [ 2/3 0 1/3]
> and a continuum of estimated discrete distributions
> P = [P1(t) P2(t) P3(t)] with P2(t) --> 0..
*IF* Q is fixed. But if you start to tweak Q realize that you can have jump changes in your results. Similarly, if you have Qj that are close to, but not exactly zero, realize that your calculations could be very different from the case where Qj=0 precisely.
My application is to classifiers where the Qj are coded
as 1-of-c binary. Therefore, they are either 0 or 1.
I never considered any other usage. So in your context
I agee that you have to know exactly how Q and P approach
the limit before you can make any sensible statement about the Q*ln(P)
combination.
Hope this helps.
Greg