Bayes Probability Estimator

42 views
Skip to first unread message

Lucas Adams

unread,
Nov 23, 2015, 5:06:48 PM11/23/15
to duke
Hi there,

I am trying to fully understand the logic of the Bayes estimator. I've read Graham's post on the subject and the math link it is based on, and it seems to me that the assumptions made (more than simply conditional independence) aren't necessarily valid for this case. This is how I see it:

Say we have 2 fields to compare, A and B. The actual values we have received via similarity scores from the comparators are a,b respectively. We have set the probability values of these difference scores manually (via the high/low thresholds in the configuration), i.e we know P(M=1 | A=a) and P(M=1 | B=b) where M is the random variable denoting a match (M=1) or non-match (M=0). It is important that we specify this exactly as we have done. Lets call these values pa and pb. Ok now lets do some probability to derive the update equation described by Graham and used in Duke.

What we want is P(M=1 | A=a, B=b), i.e the probability of a match given knowledge of both comparison values. 

P(M=1| A=a, B=b)  =  P(M=1, A=a, B=b) / P(A=a, B=b)  
P(M=1| A=a, B=b)  =  P( A=a, B=b | M=1) * P(M=1) / P(A=a, B=b)  (Bayes Rule)
P(M=1| A=a, B=b)  =  P( A=a, B=b | M=1) * P(M=1) / [ P(A=a, B=b | M=1)*P(M=1) + P(A=a,B=b | M=0)*P(M=0) ] (Law of total probability)

Now up to this point we have just used simple probability rules. We haven't made any simplifying assumptions. Now we will assume (Assumption 1)  that A,B are conditionally independent given M.

P(M=1| A=a, B=b)  =  P( A=a | M=1) * P(B=b | M=1) * P(M=1) / [ P(A=a | M=1) * P(B=b | M=1) *P(M=1) + P(A=a| M=0)*P(B=b| M=0)*P(M=0) ]

Now lets assume (Assumption 2) that the prior on a match is 0.5, i.e a match is just as likely as a non-match given no further information. This drops all of the P(M)'s.
P(M=1| A=a, B=b)  =  P( A=a | M=1) * P(B=b | M=1)  / [ P(A=a | M=1)*P(B=b | M=1)  + P(A=a| M=0)*P(B=b| M=0) ]

This now looks similar to the update used, but notice that we have the values for P(M=1 | A=a), not P(A=a | M=1). In the math notes that I think Graham is referencing (the link he posts is broken but looking around the math website you can find an article on combining probability that seems to be the correct one), the example is of 2 people (A,B) who are each trying to predict an event (E=1 or E=0). Each person has a history of predicting this event with accuracy a and b respectively. In this case, P(E=1 | A=a)  does in fact equal P(A=a | E=1) because of symmetry. If person A thinks E is 1 and has a 0.75 historical accuracy, then P(E=1 | A=predicted 1) = 0.75. If we switch this around, and say given E = 1, what is the probability that A predicted 1: P(A= predicted 1 | E=1) also equals 0.75. In general this is not the case. If we didn't assume that the probability of guessing correctly was independent of the actual answer this would not hold.  Lets assume (Assumption 3) that we can substitute these values in.
Now we have:

P(M=1| A=a, B=b)  =  P(M=1 | A=a) * P(M=1 | B=b)  / [ P(M=1 | A=a) * P(M=1 | B=b)   + P(M=0 | A=a) * P(M=0 | B=b) ]
P(M=1| A=a, B=b)  = pa * pb  / [ pa * pb   + P(M=0 | A=a) * P(M=0 | B=b)  ]
P(M=1| A=a, B=b)  = pa * pb  / [ pa * pb   +(1-pa) * (1-pb) ]

And we have the equation that Graham uses in spam filtering and Duke uses for entity resolution. Given that all three of these assumptions are violated, it seems to me that the answers given by Duke should be correlated with the answers given by the actual probability updates, but not exactly true. On the other hand, allowing the user to specify arbitrary probability values for the ranges given by a certain Comparator alleviates the issues introduced by assumption 2. With the ability to change the probabilities and the thresholds, it gets around the fact that the 0.5 prior isn't accurate. Obviously most people understand the issues with the naive  assumption of naive bayes which is contained in assumption 1. In practice many people believe that you can still get fairly good prediction, though this gets worse as the features become more dependent. My interest lies in the effect of assumption 3. Does anyone have a good intuition as to the effects this could produce? Or have I made a mistake here somewhere? My feeling is that this assumption reduces the discriminatory power that some features should have over other features. Again, this could be heuristically controlled by tuning various probability parameters in the config. Maybe this is why it seems particularly hard to tune the parameters?  




Lucas Adams

unread,
Nov 23, 2015, 5:15:35 PM11/23/15
to duke
http://cs.wellesley.edu/~anderson/writing/naive-bayes.pdf Here is a pdf that is similar to what I was talking about. It has a little comparison to Graham's equation at the end, though it doesn't point out the P(X=x|Y=y) = P(Y=y | X=x)  assumption that Graham makes.
Reply all
Reply to author
Forward
0 new messages