Gus Gassmann writes:
The double "money" was an obvious typo. It should have been "naughty"
and "money".
The independence assumptions are an essential part of the Naive Bayes
method, the naive part. They are known to be strictly false but they
simplify things and the method still seems to work in practice.
I don't have any personal experience with such methods, however. My
guess about more data making the assumptions "less false" in practice
is just a guess.
> > So, as the formula seem to be correct in my example, I guess my
> > question would be, is there any way of binging the number back
> > between 0 and 1? or can I simply assume that anything > 1 is in
> > fact 1, (or almost 1).
>
> As Jussi explained, you have to use the data correctly, and you got
> to think a little. What is Prob(spam|naughty)? What is
> Prob(spam|money)? What then do you think of Prob(spam|naughty AND
> money)? Hint, in this case you do not even need Bayes.
That wouldn't be Naive. I don't see what to do without Bayes either,
is that just me? And what if there are no instances of "naughty" AND
"money" when assigning the probabilities?
> > Following on to that, I also see many examples where the
> > denominator can be ignored as it can be regarded as constant. But
> > then how can I calculate how close the probability of a number is
> > to 1? (because without a denominator I have no idea how close the
> > probability of a document is to be 'spam').
>
> You`ll have to explain better what you mean by this. As is, it make
> no sense to me.
The denominator does not matter when one is comparing alternatives
that have the same denominator. When one says that the posterior is
proportional to the prior and the likelihood, one thinks of that
denominator as an uninteresting proportionality constant.
Perhaps it's so that an actual probability P("spam" | data) alone
would be more or less meaningful, but with values that are only
proportional to probabilities one would tneed such values for both
"spam" and "not spam" to convey the same information.