I’ve been lurking here, and have noticed that there’s been a lot of talk about probability. Unfortunately, a lot has been written rather sloppily, so that the actual workings of probability theory have not been clarified but have actually been obscured. It’s clear that a lot of folks understand some basic issues, but they haven’t been written down carefully enough so that the real issues are clear.
By way of background, I have been teaching probability and statistics at the undergraduate and graduate level for a number of decades, and although I am now retired from teaching formal classes I am still associated with the statistics program at the University of Vermont and still work with our students and faculty there.
I do not want to get into arguments, name-calling and snide remarks, which I’ve seen all too much of on talk.origins. My reasons for posting this are to clarify issues that I think haven’t been expressed clearly. I will certainly be glad to explain anything I write that isn’t clear, but I refuse to get into name-calling and other nastiness. At my age, I really have little appetite for that kind of thing.
Probability Theory: The Logic Of Science, by the late E. T. Jaynes is probably the best free source readily available for what I’m going to talk about. The first few chapters of the book describe in detail the basics of probability theory and how, in the Bayesian context that I will adopt here, these principles can be used in practice. I highly recommend this source. The completed chapters of Prof. Jaynes’ book, which was still in progress when he passed away, can be downloaded here:
https://home.fnal.gov/~paterno/probability/jaynesbook.html
To explain the problems I’ve been having with the discussion here, I want to describe how I used my first class day in a course on Bayesian Inference and Decision Theory, which I taught many times at two universities. It was an honors college course for freshmen or sophomores (depending on the university) and was taught to a group of very intelligent students whose majors ran the gamut, some in the sciences, a number of pre-med students, liberal arts majors, and once even a dance major. It was taught as a seminar with the students and myself seated around a group of tables arranged in a square so that everyone could see everyone else. Typically the class would have 15-20 students.
On the first day of class (after everyone had introduced themselves) I would conduct the following experiment: I would tell them that I had a coin in my pocket and that I would in a moment take it out of my pocket and toss it. I asked them what the probability was (in their opinion) that the coin would come up heads. Universally they would answer, correctly, that it was 50%.
Then I would reach into my pocket, pull out the coin, toss it so that it landed on the floor and quickly, before anyone (including myself) saw how it came up, put my foot on it. I then asked what the probability was that it came up heads. The students would universally say, 50%.
I then would uncover the coin briefly and out of sight of the students note how it came up. I then would put my foot on it again, and say, “I know how the coin has fallen. What do you say the probability is that it is heads?” This question usually produced a difference of opinion amongst the students. Most still said it was 50%, but some would say that, since someone (me) knew how it actually came up, probability no longer applied, and would say that the coin had come up either heads or tails, but they just didn’t know which.
Actually, either answer is a reasonable one, although they reflect different interpretations of probability:
https://en.wikipedia.org/wiki/Probability_interpretations
The first one, that it was still 50%, reflects a Bayesian point of view that probability is a way of describing ones personal (subjective) uncertainty as to the truth of a proposition (in this case, the student’s belief that the coin shows heads) given what one knows:
https://en.wikipedia.org/wiki/Bayesian_inference
The second one, that it was either heads or tails but that probability no longer applies, reflects a frequentist point of view in which probability statements apply to a sequence of identical events (in this case, a lot of coin tosses) but that once a given event has been instantiated, it no longer makes sense to talk about the probability of that particular event:
https://en.wikipedia.org/wiki/Frequentist_inference
Now, I’m not certain of this but I’m guessing that it may be that Dr. Kleinman’s point of view is basically frequentist, in which case it may have something to do with the differences he’s been having with others on the list. Perhaps he can elaborate (or tell me that my guess is wrong). But there are problems with how others here have been explaining their point of view, which I’ll get to in a bit. In any case, if there were students in my class that took the point of view that probability no longer applies, I would spend some time discussing this as well as explaining the Bayesian point of view, and saying that the Bayesian point of view is the one we would be using during the semester. I would point out to them that, for example, it would still make sense for one student in the class to bet at even odds with another student that the coin came up heads, even though the professor knows how the coin came up, and that the ability to make bets at reasonable odds is part of the Bayesian view of probability.
As an important aside, I want to point out that under the Bayesian view of probability, it is perfectly reasonable to talk about the probability of events that have already happened (or not happened), or even about events that are not in any way the result of a random process like that of a sequence of identical coin tosses. The Bayesian use of probability, which does use the standard rules of ordinary probability theory, can and does apply to unique events. What it does is to give us a way of talking about how certain or uncertain we are about the truth or falsity of those events or facts, given what we know, given information we use to make that evaluation. Examples of the sorts of things we can sensibly discuss using the Bayesian approach would be the probability that a person on trial for murder is in fact the murderer, after we hear the evidence at trial; the probability that a particular physical theory like general relativity is a correct description of nature; or the probability that a particular horse will win the Kentucky Derby. All of these are unique events, some of them already a fact (if unknown to us…the person on trial for murder, for example, knows for sure if he is innocent or guilty, and of course the laws of physics are what they are). And for example, in the case of the Kentucky Derby, people actually do assess the probability of the truth or falsity of the proposition that a particular horse will win a particular race by either placing a bet (at particular odds that reflect the probability, in their view, that the horse bet on will win) or not placing such a bet (if in their view the odds offered would make the bet unfair to them given their assessment of the probability of the horse winning). Even physicists have been known to make and later pay off bets on the truth or falsity of a physical theory, and those bets can be translated into probabilities (e.g., a bet at even odds corresponds to a probability of 0.5 that the proposition is true; at 3:1 odds, depending on which side of the bet you take, it corresponds to a probability of 0.25 or 0.75). It turns out that the correct way to update probabilities on unique propositions like these as new data becomes available is prescribed by standard probability theory, as used by Bayesians.
Back to my experiment with my students.
Next, I tell the students what I saw when I looked at the coin; let’s say it was heads, and if so I will tell them I saw heads. I then ask them what the probability is that the coin is heads (and here we are talking about their subjective probability, each student’s individual degree of belief). At this point there’s a dilemma: The students realize that I might be lying! So, they have to make some estimate, on the first day of class, that the professor that they have just met might be lying, perhaps to make a pedagogical point. (In fact I always tell the truth here, but the students have no way of knowing this). So the students in general are unwilling to say that the probability is 100% that it’s heads. Generally they give me some credit for truthfulness but they don’t go all the way. They might typically say 90%.
I then invite a student to look at the coin. I uncover it by moving my foot, the student peeks, and I cover it again. I then ask the student to state what she saw. In almost every case the student would say what I said…and the other students will increase their estimate of the probability, but not necessarily to 100%. One time a student contradicted my statement; a very smart guy, he is now a tenured professor at a major university with a doctoral degree from one of the best statistics programs in the world.
Back to the experiment.
I then invite the students to look at the coin themselves (easy to do in a small seminar class), and everyone then agrees that it came up heads, so their probability that it came up heads (given what they now know) is 100%.
I then ask what the probability is that the coin will come up heads if I toss it again. The students say 50%. I then pass the coin around and ask them to examine it, and they discover that the coin has two heads (or in about half the classes, two tails). Even though it is true as I told them in the beginning that the probability of heads was 50%, I did not tell them that the reason for that is that I had two coins in my pocket, one with two heads, the other with two tails, and that I drew one of them at random. So it wasn’t the tossing of the coin that was the random process, it was the random choice of which coin to toss! This then gave me an additional opportunity to discuss the Bayesian point of view that probability is a way of describing your subjective uncertainty about the truth of a proposition, given the evidence you have available at the time you make that probability assessment.
OK, this is a long story, how does it bear on the discussions of probability in this newsgroup?
First, I want to say that all probability statements are conditional on the background information you assume. By this I mean that if you want to make a statement in probability theory, it’s not enough to ask “What’s the probability of X?” You also have to state the background information B that is either known or assumed as part of that probability assessment. As that background information changes, the probability assessments will also change, since for the Bayesian they are the subjective estimates of those probabilities given the background information.
So we should always be making that explicit, and in standard probability theory we do it by writing:
P(X | B) = The probability of X given that the background information B is true.
So let’s see how this goes in terms of the experiment I conducted in my first class of the semester.
Initially, the background information B is that there is a 50% chance the coin will come up Heads:
P(H | B) = 0.5
I then toss the coin and put my foot on it, call that information T (for the coin has been Tossed). The new background information is B1 = T & B (where & means logical ‘and’). But still,
P(H | B1) = 0.5
I then look at the coin and tell the students that I know how it came up. Call this new information P (for Professor knows). The new background information that the students have is B2 = P & B1. I have different background information, but we aren’t talking about my subjective probability, we are asking about the students’ subjective probability. At this point students who initially said “either heads or tails but probability doesn’t apply” are required to adopt a Bayesian point of view and talk about their subjective uncertainty, which hasn’t changed. So
P(H | B2) = 0.5
I then tell the students that the coin is Heads, and they have to assess the probability that I’m telling the truth. They do it somehow (theoretically they should do this by applying Bayes’ theorem, but since they haven’t learned about that at this point they have to do it seat-of-the-pants). The new information is S, the professor Says that the coin is Heads. The new background information is B3 = S & B2.
P(H | B3) = 0.9 (say).
There will be a similar situation if one of the students looks at the coin and agrees that it was Heads:
P(H | B4) = 0.95 (say) [Again, not for the student that looked, but for the others in the class]
Finally everyone looks and observes for themselves data H that the coin is heads: B5 = H & B4 so
P(H | B5) = P(H | H & B4) = 1, which follows because for any H and C, P(H | H & C) = 1
OK, so this is how my class experiment went and this is how I analyzed it in terms of Bayesian probability. It also illustrates my beef with some folks who have been saying things like “once we observe that X is true, the probability that X is true is 1, it’s no longer 0.5, because conditional, blah blah blah.” The problem is that a statement like this is ambiguous, because it doesn’t make proper use (or really any use) of the notation that’s been developed to make it clear exactly what you are saying when you make a statement using conditional probability, what is conditional on what. And the reason for that is that many people have just not been clear about background information they are assuming when they’ve been making these statements.
In particular in my experiment, P(H | B) = 0.5, even after the student has looked at the coin and determined that it came up heads. The probability P(H | B) has as its background information only that the coin, when flipped, has probability 0.5 of coming up heads. After you have looked at the coin and determined that it came up heads, P(H | B5) = 1 for sure, but that is not the same as P(H | B), which is still 0.5 and is not changed by anything that happened subsequently, because the background information B≠B5.
Proper use of conditioning on background information makes this very clear. But the wordy statements that have been made here haven’t been clear and that may be why there is confusion on this point.
To illustrate how important this is, I want to mention a case where failure to take careful account of precisely this issue led to incorrect conclusions published in the literature. A paper was published a number of years ago that demonstrates how careless use of probability theory, in particular its failure to condition correctly on known data, can lead to faulty conclusions. The argument the author made was that the Bayesian machinery does not allow one to use “old data”, that is, data that you already knew, in order to use Bayes’ theorem to update a prior to get an updated posterior. But because of the need for careful conditioning, the argument given and the conclusion were wrong.
Reminder: Bayes’ theorem is the basic tool used in Bayesian inference. It derives straightforwardly from the multiplication rule for probabilities (here X will be data, H a hypothesis under test, and B the background information):
P(H & X | B) = P(H | X & B)P(X | B) = P(X | H & B)P(H | B) [Multiplication rule used twice here.]
Note that the multiplication rule does not say that P(X & H | B) = P(X | B)P(H | B). This is only valid if X and H are independent under B, which is in general not the case.
Now, by dividing this equation through by P(X | B) (assuming this is not zero) we get Bayes’ theorem:
P(H | X & B) = P(X | H & B)P(H | B)/P(X | B).
Here I have explicitly noted the background information B. Leaving B out can be dangerous, as we will see below.
The divisor, P(X | B), is correctly computed by summation over the available hypotheses, which I will write for simplicity here as H and ¬H (not-H):
P(X | B) = P(X & H | B) + P(X & ¬H | B) = P(X | H & B)P(H | B) + P(X | ¬H & B)P(¬H | B)
OK, with this background, the philosopher’s argument went this way: Since the data X are known and “old”, according to the author (who left out B) it follows that
P(X) = 1 (??)
Note the failure to specify the background information that X is known, which is why I flag this equation as questionable. Normally one would calculate P(X) by summing over all hypotheses, using the formula I gave just above for P(X | B). One would never just assume the value of P(X) as this argument does.
It therefore follows from this suspect assumption and standard probability theory that P(X | H) = 1 for any H.
Let P(H) be the prior on hypothesis H. Then Bayes’ theorem would state, under these questionable assumptions, that the posterior probability on H, given X, is
P(H | X) = P(X | H)P(H)/P(X) = P(H)
That is, according to this argument if you plug the data into Bayes’ theorem, the posterior is equal to the prior and you haven’t learned anything.
This is clearly wrong and stems from the incorrect notion that P(X) = 1 if X is known data. The correct way to write this is by explicitly conditioning on the known data as part of the background information, that is, P(X | X & B) = 1, where for pedantic completeness I’ve included B, the additional background information that is independent of X. Then you get the following (correct) proof of something (but what?):
P(X | X & B) = 1
And from Bayes’ theorem (where, crucially, you have to condition the prior probability of H on X to write Bayes’ rule correctly):
P(H | X & X & B) = P(X | H & X & B)P(H | X & B)/P(X | X & B) = P(H | X & B)
and you haven’t learned anything.
But this isn’t a proof that you can’t learn from old data, it’s a proof that you can’t use the same piece of data X twice, because to write Bayes’ rule correctly, the prior on H has to be P(H | X & B), that is, it is the probability of H given that X is true, so your second attempt to use the data X doesn’t change the posterior probability of H. [Of course, this also follows trivially because logically X & X = X.] You can still determine P(H | X & B) by applying Bayes’ rule with everything on the right hand side unconditioned on X:
P(H | X & B) = P(X | H & B)P(H | B)/P(X | B)
where it is no longer incorrectly assumed that P(X)=1 [or P(X | B)=1], so that you can do the calculation without a problem arising due to the suspect assumption that P(X)=1 for old data. Instead, one would use the summation formula above by summing over all hypotheses to evaluate P(X | B).
The thing that worries me about some of the things that have been written here is that it looks like some people are making the same mistake that we see in this paper: That once data X have been observed, the probability of X is 1. But that is just failure to use conditioning carefully and properly. The correct statement is that whether or not data X have been observed, the probability of X given X is 1. That is entirely different. That’s just P(X | X) = 1, which is the probability calculus version of the logical tautology that for any X, X implies X (in logic notation, X → X) i.e., “If X is true, then X is true”. As the Jaynes book makes clear, from the point of view of probability theory that he describes, probability theory is (up to an isomorphism) the unique extension of standard logic to the case where truth values can take on any value on the interval [0,1]. But read the first few chapters of the book, which explains it very well.
The correct statement is that P(X | X) = 1, but that P(X) (unconditioned on X) can be anything, it is whatever you would think that probability to be without using the information that X is true. And when you use Bayes’ theorem, you should always calculate P(X) by summation over all hypotheses as described above.
Bill Jefferys