Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion Probabilities always >= 0 and <= 1?
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Jussi Piitulainen  
View profile  
 More options May 8 2012, 12:14 pm
Newsgroups: sci.math
From: Jussi Piitulainen <jpiit...@ling.helsinki.fi>
Date: 08 May 2012 19:14:24 +0300
Local: Tues, May 8 2012 12:14 pm
Subject: Re: Probabilities always >= 0 and <= 1?

Gus Gassmann writes:
> On May 8, 12:00 pm, FFMG wrote:
> > On Tuesday, 8 May 2012 16:41:18 UTC+2, Jussi Piitulainen  wrote:
> > > FFMG writes:

> > > > > You have made an incorrect independence assumption. As both
> > > > > "naughty" and "money" are only present in "spam" documents,
> > > > > which form half o the total number of documents, they are
> > > > > dependent variables. But, you calculate p(e) as p(money) *
> > > > > p(naughty) which is assuming that the variables are
> > > > > independent. Hence your problem.

> > > > Sorry, that's not an assumption, that's the way the problem
> > > > definition goes, the words "naughty" and "money" are indeed only
> > > > present in "spam".

> > > > And they are independent variables, the presence of "naughty"
> > > > is not dependent on "money", (and vice versa).

> > > > The formula is P(C|F1...Fn) = P(C)P(F1|C)...P(Fn|C)
> > > >                               -----------------
> > > >                                 P(F1)...P(Fn)

> > > > So, given the problem in my original post, the result is not
> > > > between 0 and 1.

> > > Probability theory only gives you
> > > P(C | F1...Fn) = P(C) P(F1...Fn | C) / P(F1...Fn).

> > > Then come the independence assumptions which allow you to expand
> > > P(F1...Fn | C) as P(F1 | C)...P(Fn | C) and P(F1...Fn)
> > > similarly.  These give Naive Bayes its first name.

> > > If "naughty" and "money" were exactly independent and
> > > probabilities exactly relative frequencies in your document
> > > collection, there should be half a document that contains them
> > > both. Half a document does not quite make sense, but there's
> > > worse: if "naughty" and "money" were exactly independent given
> > > "spam", there should be _one_ document that contains both
> > > "naughty" and "money" (and is classified as "spam").

> > > Since we don't want to accept 1/2 = 1 and we think that relative
> > > frequencies do have the formal properties of probabilities, we
> > > blame the independence assumptions. I suppose they would be
> > > approximately closer to the truth much of the time in a larger
> > > population.

> > So, if I understand you correctly the 2 issues at hand are:
> > 1) I don't have enough documents and classified words, (or at
> > least the more I have the more likely I will get to between 0 and
> > 1).
> > 2) The Naive Bayes formula will not guarantee a number between 0
> > and 1 only.

> Neither. There is no requirement on the number of documents, and the
> Bayes formula works. However, your original stab was certainly
> false, since it contained "money" twice and did not contain
> "naughty" at all.  As such it is difficult even to figure out what
> you were trying to compute.

The double "money" was an obvious typo. It should have been "naughty"
and "money".

The independence assumptions are an essential part of the Naive Bayes
method, the naive part. They are known to be strictly false but they
simplify things and the method still seems to work in practice.

I don't have any personal experience with such methods, however. My
guess about more data making the assumptions "less false" in practice
is just a guess.

> > So, as the formula seem to be correct in my example, I guess my
> > question would be, is there any way of binging the number back
> > between 0 and 1? or can I simply assume that anything > 1 is in
> > fact 1, (or almost 1).

> As Jussi explained, you have to use the data correctly, and you got
> to think a little. What is Prob(spam|naughty)?  What is
> Prob(spam|money)?  What then do you think of Prob(spam|naughty AND
> money)? Hint, in this case you do not even need Bayes.

That wouldn't be Naive. I don't see what to do without Bayes either,
is that just me? And what if there are no instances of "naughty" AND
"money" when assigning the probabilities?

> > Following on to that, I also see many examples where the
> > denominator can be ignored as it can be regarded as constant. But
> > then how can I calculate how close the probability of a number is
> > to 1? (because without a denominator I have no idea how close the
> > probability of a document is to be 'spam').

> You`ll have to explain better what you mean by this. As is, it make
> no sense to me.

The denominator does not matter when one is comparing alternatives
that have the same denominator. When one says that the posterior is
proportional to the prior and the likelihood, one thinks of that
denominator as an uninteresting proportionality constant.

Perhaps it's so that an actual probability P("spam" | data) alone
would be more or less meaningful, but with values that are only
proportional to probabilities one would tneed such values for both
"spam" and "not spam" to convey the same information.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.