Probabilities always >= 0 and <= 1?

FFMG

unread,

May 8, 2012, 5:35:35 AM5/8/12

to

Hi,

I was looking at a a site, (http://bionicspirit.com/blog/2012/02/09/howto-build-naive-bayes-classifier.html), basically talking about a Naive Bayes Classifier.

But in some cases the formula gives me probabilities greater than 1.
How it is possible?

// Total of 18 documents.
// * 9 documents out of a total of 18 are spam messages
// * 3 documents out of those 18 contain the word "naughty"
// * 3 documents containing the word "naughty" have been marked as spam
// * 3 documents out of the total contain the word "money"
// * 3 emails out of those have been marked as spam

P(spam|naughty,money) = P(money|spam) * P(money|spam) * P(spam)
--------------------------
P(naughty) * P(money)

P(spam|naughty,money) = 3/9 * 3/9 * 9/18 = 2
----------------
3/18 * 3/18

But how can a probability be outside of 0 and 1? Must I always force the numbers to be between 0 and 1 and accept that in some cases they will fall outside the range?

Many thanks for suggestions as to where I might have gone wrong.

Regards,

FFMG

Ross

unread,

May 8, 2012, 7:31:31 AM5/8/12

to

On May 8, 10:35 am, FFMG <spambuc...@myoddweb.com> wrote:
> Hi,
>

> I was looking at a a site, (http://bionicspirit.com/blog/2012/02/09/howto-build-naive-bayes-class...), basically talking about a Naive Bayes Classifier.

>
> But in some cases the formula gives me probabilities greater than 1.
> How it is possible?
>
> // Total of 18 documents.
> // * 9 documents out of a total of 18 are spam messages
> // * 3 documents out of those 18 contain the word "naughty"
> // * 3 documents containing the word "naughty" have been marked as spam
> // * 3 documents out of the total contain the word "money"
> // * 3 emails out of those have been marked as spam
>
> P(spam|naughty,money) = P(money|spam) * P(money|spam) * P(spam)
> --------------------------
> P(naughty) * P(money)
>
> P(spam|naughty,money) = 3/9 * 3/9 * 9/18 = 2
> ----------------
> 3/18 * 3/18
>
> But how can a probability be outside of 0 and 1? Must I always force the numbers to be between 0 and 1 and accept that in some cases they will fall outside the range?
>
> Many thanks for suggestions as to where I might have gone wrong.
>
> Regards,
>
> FFMG

You have made an incorrect independence assumption. As both "naughty"
and "money" are only present in "spam" documents, which form half o
the total number of documents, they are dependent variables. But, you
calculate p(e) as p(money) * p(naughty) which is assuming that the
variables are independent. Hence your problem.

FFMG

unread,

May 8, 2012, 10:08:29 AM5/8/12

to

>
> You have made an incorrect independence assumption. As both "naughty"
> and "money" are only present in "spam" documents, which form half o
> the total number of documents, they are dependent variables. But, you
> calculate p(e) as p(money) * p(naughty) which is assuming that the
> variables are independent. Hence your problem.

Sorry, that's not an assumption, that's the way the problem definition goes, the words "naughty" and "money" are indeed only present in "spam".

And they are independent variables, the presence of "naughty" is not dependent on "money", (and vice versa).

The formula is P(C|F1...Fn) = P(C)P(F1|C)...P(Fn|C)
-----------------
P(F1)...P(Fn)

So, given the problem in my original post, the result is not between 0 and 1.

FFMG

Jussi Piitulainen

unread,

May 8, 2012, 10:41:18 AM5/8/12

to

Probability theory only gives you
P(C | F1...Fn) = P(C) P(F1...Fn | C) / P(F1...Fn).

Then come the independence assumptions which allow you to expand
P(F1...Fn | C) as P(F1 | C)...P(Fn | C) and P(F1...Fn) similarly.
These give Naive Bayes its first name.

If "naughty" and "money" were exactly independent and probabilities
exactly relative frequencies in your document collection, there should
be half a document that contains them both. Half a document does not
quite make sense, but there's worse: if "naughty" and "money" were
exactly independent given "spam", there should be _one_ document that
contains both "naughty" and "money" (and is classified as "spam").

Since we don't want to accept 1/2 = 1 and we think that relative
frequencies do have the formal properties of probabilities, we blame
the independence assumptions. I suppose they would be approximately
closer to the truth much of the time in a larger population.

FFMG

unread,

May 8, 2012, 11:00:50 AM5/8/12

to

So, if I understand you correctly the 2 issues at hand are:
1) I don't have enough documents and classified words, (or at least the more I have the more likely I will get to between 0 and 1).
2) The Naive Bayes formula will not guarantee a number between 0 and 1 only.

So, as the formula seem to be correct in my example, I guess my question would be, is there any way of binging the number back between 0 and 1? or can I simply assume that anything > 1 is in fact 1, (or almost 1).

Following on to that, I also see many examples where the denominator can be ignored as it can be regarded as constant. But then how can I calculate how close the probability of a number is to 1? (because without a denominator I have no idea how close the probability of a document is to be 'spam').

Thanks again

FFMG

FFMG

Gus Gassmann

unread,

May 8, 2012, 11:35:51 AM5/8/12

to

Neither. There is no requirement on the number of documents, and the
Bayes formula works. However, your original stab was certainly false,
since it contained "money" twice and did not contain "naughty" at all.
As such it is difficult even to figure out what you were trying to
compute.

> So, as the formula seem to be correct in my example, I guess my question would be, is there any way of binging the number back between 0 and 1? or can I simply assume that anything > 1 is in fact 1, (or almost 1).

As Jussi explained, you have to use the data correctly, and you got to
think a little. What is Prob(spam|naughty)? What is Prob(spam|money)?
What then do you think of Prob(spam|naughty AND money)? Hint, in this
case you do not even need Bayes.

> Following on to that, I also see many examples where the denominator can be ignored as it can be regarded as constant. But then how can I calculate how close the probability of a number is to 1? (because without a denominator I have no idea how close the probability of a document is to be 'spam').

You`ll have to explain better what you mean by this. As is, it make no
sense to me.

FFMG

unread,

May 8, 2012, 12:07:33 PM5/8/12

to

>
> Neither. There is no requirement on the number of documents, and the
> Bayes formula works. However, your original stab was certainly false,
> since it contained "money" twice and did not contain "naughty" at all.
> As such it is difficult even to figure out what you were trying to
> compute.

This is my original statement, and it does contain "naughty", (in 3 of the 18 docs, all are 'spam' docs) and "money" is also in 3 of the 18 docs and they are also all 'spam'.

// Total of 18 documents.
// * 9 documents out of a total of 18 are spam messages
// * 3 documents out of those 18 contain the word "naughty"
// * 3 documents containing the word "naughty" have been marked as spam
// * 3 documents out of the total contain the word "money"
// * 3 emails out of those have been marked as spam

>

> > So, as the formula seem to be correct in my example, I guess my question would be, is there any way of binging the number back between 0 and 1? or can I simply assume that anything > 1 is in fact 1, (or almost 1).
>
> As Jussi explained, you have to use the data correctly, and you got to
> think a little. What is Prob(spam|naughty)? What is Prob(spam|money)?
> What then do you think of Prob(spam|naughty AND money)? Hint, in this
> case you do not even need Bayes.
>
> > Following on to that, I also see many examples where the denominator can be ignored as it can be regarded as constant. But then how can I calculate how close the probability of a number is to 1? (because without a denominator I have no idea how close the probability of a document is to be 'spam').
>
> You`ll have to explain better what you mean by this. As is, it make no
> sense to me.

Sorry about that, there are a few example, (in the original link I posted), or here as well http://en.wikipedia.org/wiki/Naive_Bayes_classifier#The_naive_Bayes_probabilistic_model, "In practice we are only interested in the numerator of that fraction, since the denominator does not depend on C and the values of the features F_i are given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability model"

Regards,

FFMG

Jussi Piitulainen

unread,

May 8, 2012, 12:14:24 PM5/8/12

to

Gus Gassmann writes:

The double "money" was an obvious typo. It should have been "naughty"
and "money".

The independence assumptions are an essential part of the Naive Bayes
method, the naive part. They are known to be strictly false but they
simplify things and the method still seems to work in practice.

I don't have any personal experience with such methods, however. My
guess about more data making the assumptions "less false" in practice
is just a guess.

> > So, as the formula seem to be correct in my example, I guess my
> > question would be, is there any way of binging the number back
> > between 0 and 1? or can I simply assume that anything > 1 is in
> > fact 1, (or almost 1).
>
> As Jussi explained, you have to use the data correctly, and you got
> to think a little. What is Prob(spam|naughty)? What is
> Prob(spam|money)? What then do you think of Prob(spam|naughty AND
> money)? Hint, in this case you do not even need Bayes.

That wouldn't be Naive. I don't see what to do without Bayes either,
is that just me? And what if there are no instances of "naughty" AND
"money" when assigning the probabilities?

> > Following on to that, I also see many examples where the
> > denominator can be ignored as it can be regarded as constant. But
> > then how can I calculate how close the probability of a number is
> > to 1? (because without a denominator I have no idea how close the
> > probability of a document is to be 'spam').
>
> You`ll have to explain better what you mean by this. As is, it make
> no sense to me.

The denominator does not matter when one is comparing alternatives
that have the same denominator. When one says that the posterior is
proportional to the prior and the likelihood, one thinks of that
denominator as an uninteresting proportionality constant.

Perhaps it's so that an actual probability P("spam" | data) alone
would be more or less meaningful, but with values that are only
proportional to probabilities one would tneed such values for both
"spam" and "not spam" to convey the same information.

Gus Gassmann

unread,

May 8, 2012, 1:13:47 PM5/8/12

to

On May 8, 1:14 pm, Jussi Piitulainen <jpiit...@ling.helsinki.fi>
wrote:

It is not you. My reading comprehension is somewhat lacking today. "Is
spam" and "has been marked as spam" does not have to be the same. Duh!

Ross

unread,

May 8, 2012, 2:08:48 PM5/8/12

to

They are only independent if p(naughty & money) = p(naughty) *
p(money)

Or we could put it as:

p(naughty|money) = p(naughty)

But this is incredibly unlikely in your problem, as both "naughty" and
"money" appear only the documents which are spam. Hence it is most
likely that:

p(naughty|money) > p(naughty)

and also:

p(naughty & money) > p(naughty) * p(money)

This inequality is the root cause of your problem.

I think I've given you the correct explanation of your error, and
you've just ignored it. I don't know what to say in addition to that.

Ross

unread,

May 8, 2012, 2:17:21 PM5/8/12

to

On May 8, 5:07 pm, FFMG <spambuc...@myoddweb.com> wrote:
> Sorry about that, there are a few example, (in the original link I posted), or here as wellhttp://en.wikipedia.org/wiki/Naive_Bayes_classifier#The_naive_Bayes_p..., "In practice we are only interested in the numerator of that fraction, since the denominator does not depend on C and the values of the features F_i are given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability model"
>
> Regards,
>
> FFMG

But your question is why you get probabilities that are outside the
interval [0,1].

In that case, you have to look at the normalising factor p(e), where
you have an inappropriate independence assumption. In a previous post,
you've stated that the words are independent, but they aren't. And
that's the root cause of you getting numbers outside the bounds of
classical probability.

The independence assumptions of the naive Bayes classifier give you a
very simple computation which for easy domains such as text
classification often give you a reasonably accurate prediction of the
most likely text classification. But that simplicity comes at a price,
which is a loss of accuracy.

FFMG

unread,

May 8, 2012, 2:18:03 PM5/8/12

to

>
> But this is incredibly unlikely in your problem, as both "naughty" and
> "money" appear only the documents which are spam. Hence it is most
> likely that:

Well, I am sorry, but this is the way the problem is defined, "naughty" and "money" appear only the documents which are spam.

It might be unlikely in a real life scenario, but that should not make the formula 'wrong'.

With the problem I gave, the Naive Bayes formula returns a probability of 2.

Thanks

FFMG

Ross

unread,

May 8, 2012, 2:34:05 PM5/8/12

to

Actually, I've thought of a clearer way of explaining the OP's
problem.

Bayes' theorem says that:

p(h|e) = (p(e|h) * p(h)) / p(e)

In the OP's example, h is "spam" and "e" is "naughty & money", for
which I'll use the common way of writing "naughty,money". So, that
becomes":

p(spam|naughty,money) = ( p(naughty,money|spam) * p(spam) ) /
p(naughty,money)

That's the correct formalisation of the problem. However, in writing
the problem in a naive Bayes formalism, the OP has produced:

p(spam|naughty,money) = ( p(naughty|spam) * p(money|spam) *
p(spam) ) / p(naughty) * p(money)

This has two approximations, the first of which is

p(naughty,money|spam) approx= p(naughty|spam) * p(money|spam)

The second, and intuitively more likely to be inaccurate is:

p(naughty,money) approx= p(naughty) * p(money)

Since both the word "naughty" and the word "money" are positively
correlated with the underlying concept "spam", the correct probability
p(naughty,money) is likely to be much higher than p(naughty) *
p(money), and the denominator being smaller than it should be is
likely to be what causes the result of calculation to be 2.

What the OP needs to do is to flesh out his/her example by listing all
18 documents, telling us exactly which are spam and which aren't, and
which include the word "naughty" or the word "money".

E.g. a table like this:

doc|classification|num words|count('naughty')|count('money')
1|spam|20|3|2
2|not spam|33|0|0
......

With a full table like that, it would be possible to explain the OP's
problem.

Ross

unread,

May 8, 2012, 2:38:35 PM5/8/12

to

The naive Bayes formula ESTIMATES a probability of 2. The fact that
it's 2 shows that the estimate is inaccurate. I'm trying to explain to
you why it is inaccurate in your problem.

There is nothing wrong with "naughty" and "money" appearing only in
the documents that are spam. But, what happens then is that they are
most likely dependent variables, since if we spot one of them in a
document that raises the probability that we see the other one. I.e.

p(naughty&money) > p(naughty) * p(money)

which shows that they are dependent variables, not independent ones.

Since the variables are dependent, while the naive Bayes classifier
assumes independence, leads to inaccuracy in the estimation of the
probability.

Ray Vickson

unread,

May 8, 2012, 2:59:34 PM5/8/12

to

Using your independence assumption I get P{spam|money,naughty} = 1. Here are my calculations (using notation M = money, S = spam and N = naughty). We are given P(S) = 9/18 = 1/2, P(N|S) = 3/9 = 1/3 and P(M|S) = 1/3. You assumed *conditional independence*, which is that P(N,M|S) = P(N|S)*P(M|S) = 1/9. There is NO reason why this should be true; some models will have it true, others not. However, let's assume it, so we can see what are the consequences.

So, assuming conditional independence, we have:
P(S|n,M) = P(N,M|S)*P(S)/P(N,M).

Now P(N,M) = P(N,M|S)*P(S) + P(N,M|not_S)*P(not_S). We have no observations of N or M in the presence of not_S, so we could ASSUME P(N,M|not_S) = 0. Again, there is no compelling reason for this assumption, but making it simplifies things.

So, let's assume P(N,M|not_S) = 0. We then get P(N,M) = P(N,M|S)*P(S), hence, finally: P(S|N,M) = P(N|S)*P(M|S)*P(S)/[P(N|S)*P(M|S)*P(S)] = 1.

Note: if you assume, instead, that P(N,M|not_S) = q (for some 0 < q < 1), we have P(N,M) = (1/2)(1/3)(1/3) + q*(1/2) = 1/18 + q/2, which would give P(S|N,M) = (1/18)/[(1/18) + (q/2)] < 1.

RGV

FFMG

unread,

May 8, 2012, 2:07:27 PM5/8/12

to

Thanks for all the replies, I guess I will force the documents classification between 0 and 1, because in my case I will have 100 of thousands of documents, (we have +200000 currently), and hopefully it will not take more than 5000 'training' to get some meaningful data classification.

I just thought that even with my 18 documents I should still get a probability between 0 and 1.

My main task was to write unit tests, and if the correct result in my test with 18 documents is a probability of '2' then I guess the calculations are valid.

Thanks for all inputs and suggestions.

FFMG

Jussi Piitulainen

unread,

May 9, 2012, 12:14:42 AM5/9/12

to

Look again at Ray Vickson's post. I think he hit the nail on the head.

He used the law of total probablility to expand the denominator as

P("naughty", "money")
= P("naughty", "money" | "spam" or "not spam")
= P("naughty", "money" | "spam")
+ P("naughty", "money" | "not spam")

after which you can use the _same_ independence assumption in both the
numerator and the denominator. That shouldn't lead to such a blatant
contradiction.

Perhaps this is how Naive Bayes is always done. I haven't checked.