The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Message from discussion Probabilities always >= 0 and <= 1?

From:
To:
Cc:
Followup To:
Subject:
 Validation: For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon.

More options May 8 2012, 11:35 am
Newsgroups: sci.math
Date: Tue, 8 May 2012 08:35:51 -0700 (PDT)
Local: Tues, May 8 2012 11:35 am
Subject: Re: Probabilities always >= 0 and <= 1?
On May 8, 12:00 pm, FFMG <spambuc...@myoddweb.com> wrote:

> On Tuesday, 8 May 2012 16:41:18 UTC+2, Jussi Piitulainen  wrote:
> > FFMG writes:

> > > > You have made an incorrect independence assumption. As both
> > > > "naughty" and "money" are only present in "spam" documents, which
> > > > form half o the total number of documents, they are dependent
> > > > variables. But, you calculate p(e) as p(money) * p(naughty) which
> > > > is assuming that the variables are independent. Hence your
> > > > problem.

> > > Sorry, that's not an assumption, that's the way the problem
> > > definition goes, the words "naughty" and "money" are indeed only
> > > present in "spam".

> > > And they are independent variables, the presence of "naughty" is not
> > > dependent on "money", (and vice versa).

> > > The formula is P(C|F1...Fn) = P(C)P(F1|C)...P(Fn|C)
> > >                               -----------------
> > >                                 P(F1)...P(Fn)

> > > So, given the problem in my original post, the result is not between
> > > 0 and 1.

> > Probability theory only gives you
> > P(C | F1...Fn) = P(C) P(F1...Fn | C) / P(F1...Fn).

> > Then come the independence assumptions which allow you to expand
> > P(F1...Fn | C) as P(F1 | C)...P(Fn | C) and P(F1...Fn) similarly.
> > These give Naive Bayes its first name.

> > If "naughty" and "money" were exactly independent and probabilities
> > exactly relative frequencies in your document collection, there should
> > be half a document that contains them both. Half a document does not
> > quite make sense, but there's worse: if "naughty" and "money" were
> > exactly independent given "spam", there should be _one_ document that
> > contains both "naughty" and "money" (and is classified as "spam").

> > Since we don't want to accept 1/2 = 1 and we think that relative
> > frequencies do have the formal properties of probabilities, we blame
> > the independence assumptions. I suppose they would be approximately
> > closer to the truth much of the time in a larger population.

> So, if I understand you correctly the 2 issues at hand are:
> 1) I don't have enough documents and classified words, (or at least the more I have the more likely I will get to between 0 and 1).
> 2) The Naive Bayes formula will not guarantee a number between 0 and 1 only.

Neither. There is no requirement on the number of documents, and the
Bayes formula works. However, your original stab was certainly false,
since it contained "money" twice and did not contain "naughty" at all.
As such it is difficult even to figure out what you were trying to
compute.

> So, as the formula seem to be correct in my example, I guess my question would be, is there any way of binging the number back between 0 and 1? or can I simply assume that anything > 1 is in fact 1, (or almost 1).

As Jussi explained, you have to use the data correctly, and you got to
think a little. What is Prob(spam|naughty)?  What is Prob(spam|money)?
What then do you think of Prob(spam|naughty AND money)? Hint, in this
case you do not even need Bayes.

> Following on to that, I also see many examples where the denominator can be ignored as it can be regarded as constant. But then how can I calculate how close the probability of a number is to 1? (because without a denominator I have no idea how close the probability of a document is to be 'spam').

You`ll have to explain better what you mean by this. As is, it make no
sense to me.