Kullback-Leibler and zero probability estimates

Greg Heath

unread,

May 2, 2009, 5:18:44 AM5/2/09

to

I am looking for a practical measure of confidence
for a classifier that outputs input conditional
class posterior probabilities.

The nonsymmetric Kullback-Leibler divergence

D{p||q} = -INT(-inf,inf){ dx q(x)*ln(q(x)/p(x)) }

is used to measure the deviation of a model
probability distribution, p(x) from the actual
distribution q(x). This can be interpreted as
the expected value of -ln{q/p}. For discrete
distributions I assume that the corresponding
form is

D{Q||P} = -SUM(i=1:c){ Qi * ln{ Qi / Pi } }.

If the categories are mutually exclusive,
Pj = 1 requires that Pk == 0 for k~= j.
However, Pj = 0 does not force any additional
constraints on the the others.

If Pj = 0 when Qj = 0, it is obvious that
the interpretation should be

Qj * ln(Qj/Pj) = 0

However, what is the interpretation of D when the
model predicts Pj = 0 when Qj ~= 0?

Or, do you have to restrict the model (e.g.,
using logsig or softmax) so that

1. 0 < Pj < 1, (j = 1:c)
2. sum(i=1:c){ Pi } = 1

What changes when the modeled categories are
not mutually exclusive (e.g, short, fat and bald)
and Pj = 0 when Qj ~= 0?

TIA,

Greg

Greg Heath

unread,

May 2, 2009, 5:24:47 AM5/2/09

to

On May 2, 5:18 am, Greg Heath <he...@alumni.brown.edu> wrote:
> I am looking for a practical measure of confidence
> for a classifier that outputs input conditional
> class posterior probabilities.
>
> The nonsymmetric Kullback-Leibler divergence
>
> D{p||q} = -INT(-inf,inf){ dx q(x)*ln(q(x)/p(x)) }

D{q||p} = -INT(-inf,inf){ dx q(x)*ln(q(x)/p(x)) }

> is used to measure the deviation of a model
> probability distribution, p(x) from the actual
> distribution q(x). This can be interpreted as
> the expected value of -ln{q/p}. For discrete
> distributions I assume that the corresponding
> form is
>
> D{Q||P} = -SUM(i=1:c){ Qi * ln{ Qi / Pi } }.
>
> If the categories are mutually exclusive,
> Pj = 1 requires that Pk == 0 for k~= j.
> However, Pj = 0 does not force any additional
> constraints on the the others.
>
> If Pj = 0 when Qj = 0, it is obvious that
> the interpretation should be
>
> Qj * ln(Qj/Pj) = 0
>
> However, what is the interpretation of D when
> the model predicts Pj = 0 when Qj ~= 0?
>
> Or, do you have to restrict the model (e.g.,
> using logsig or softmax) so that
>
> 1. 0 < Pj < 1, (j = 1:c)
> 2. sum(i=1:c){ Pi } = 1
>
> What changes when the modeled categories are

> not mutually exclusive (e.g, short, fat and bald)?

In particular,

1. Is the form of D the same?
2. What happens if Pj = 0 when Qj ~= 0?

TIA,

Greg

Matt

unread,

May 2, 2009, 10:41:01 AM5/2/09

to

Greg Heath <he...@alumni.brown.edu> wrote in message <dd34534a-7b6c-4d08...@z7g2000vbh.googlegroups.com>...

> However, what is the interpretation of D when the
> model predicts Pj = 0 when Qj ~= 0?

The KL divergence has an undefined value in this case. Depending on your application, you might define the distance as infinite, because if you take a sequence of distributions converging to P , i.e. P^n-->P, where the distributions P^n are strictly positive for all P^n_j, the KL divergences of P^n will converge to infinite.

Matt

unread,

May 2, 2009, 11:52:02 AM5/2/09

to

Greg Heath <he...@alumni.brown.edu> wrote in message <dd34534a-7b6c-4d08...@z7g2000vbh.googlegroups.com>...

> If Pj = 0 when Qj = 0, it is obvious that

> the interpretation should be
>
> Qj * ln(Qj/Pj) = 0

Is it so obvious? Not if you need the KL divergence to behave continuously around Pj=0 or Qj=0.

For example, suppose, Qj=t and Pj=exp(-1/t). Both converge to zero as t-->0, but

Qj * ln(Qj/Pj) =t*ln(t*exp(1/t))

converges to 1

Greg Heath

unread,

May 2, 2009, 4:12:19 PM5/2/09

to

On May 2, 10:41 am, "Matt " <x...@whatever.com> wrote:
> Greg Heath <he...@alumni.brown.edu> wrote in message <dd34534a-7b6c-4d08-b547-f8ff9322d...@z7g2000vbh.googlegroups.com>...

> > However, what is the interpretation of D when the
> > model predicts Pj = 0 when Qj ~= 0?
>
> The KL divergence has an undefined value in this case. Depending on your application, you might define the distance as infinite, because if you take a sequence of distributions converging to P , i.e. P^n-->P, where the distributions P^n are strictly positive for all P^n_j, the KL divergences of P^n will converge to infinite.

Isn't "converging to infinity" an oxymoron?

Anyway, infinite seems more descriptive than "not defined".
because it can only happen when Qj ~= 0 and Pj = 0..

Hope this helps.

Greg

Greg Heath

unread,

May 2, 2009, 7:34:41 PM5/2/09

to

On May 2, 11:52 am, "Matt " <x...@whatever.com> wrote:
> Greg Heath <he...@alumni.brown.edu> wrote in message <dd34534a-7b6c-4d08-b547-f8ff9322d...@z7g2000vbh.googlegroups.com>...

> > If Pj = 0 when Qj = 0, it is obvious that
> > the interpretation should be
>
> > Qj * ln(Qj/Pj) = 0
>
> Is it so obvious? Not if you need the KL divergence to behave continuously around Pj=0 or Qj=0.
>
> For example, suppose, Qj=t and Pj=exp(-1/t). Both converge to zero as t-->0, but
>
> Qj * ln(Qj/Pj) =t*ln(t*exp(1/t))
>
> converges to 1

This example does not fit into the context of a FIXED
true discrete distribution Q = [Q1 Q2 Q3] = [ 2/3 0 1/3]
and a continuum of estimated discrete distributions
P = [P1(t) P2(t) P3(t)] with P2(t) --> 0..

Hope this is clear,

Greg

Matt

unread,

May 3, 2009, 9:12:02 AM5/3/09

to

Greg Heath <he...@alumni.brown.edu> wrote in message <b31baaab-2f8d-4f88...@g20g2000vba.googlegroups.com>...

> Isn't "converging to infinity" an oxymoron?

Well, yes. It seemed a bit easier on the tongue though, than saying "the KL divergence diverges..."

> Anyway, infinite seems more descriptive than "not defined".
> because it can only happen when Qj ~= 0 and Pj = 0..

As I say, you can define it that way if it suits you. It's not always useful to do so...

> This example does not fit into the context of a FIXED
> true discrete distribution Q = [Q1 Q2 Q3] = [ 2/3 0 1/3]
> and a continuum of estimated discrete distributions
> P = [P1(t) P2(t) P3(t)] with P2(t) --> 0..

*IF* Q is fixed. But if you start to tweak Q realize that you can have jump changes in your results. Similarly, if you have Qj that are close to, but not exactly zero, realize that your calculations could be very different from the case where Qj=0 precisely.

Greg Heath

unread,

May 3, 2009, 10:04:04 AM5/3/09

to

On May 3, 9:12 am, "Matt " <x...@whatever.com> wrote:
> Greg Heath<he...@alumni.brown.edu> wrote in message <b31baaab-2f8d-4f88-b69d-df5da8f15...@g20g2000vba.googlegroups.com>...

My application is to classifiers where the Qj are coded
as 1-of-c binary. Therefore, they are either 0 or 1.
I never considered any other usage. So in your context
I agee that you have to know exactly how Q and P approach
the limit before you can make any sensible statement about the Q*ln(P)
combination.

Hope this helps.

Greg