random variables, ; and

Bob Carpenter

unread,

Aug 22, 2016, 6:07:16 PM8/22/16

to stan...@googlegroups.com

[off issue and on dev list]

I'm trying to stop the GitHub issues from straying off into
long philosophical/mathematical discussions.

My understanding (pending revision, as always) is that the frequentists
insist on p(y ; theta) specifically because they don't treat theta as a
random variable and they are perfectly happy to write p(y | theta)
if both y and theta are random variables. Actually, the careful ones would
write

p_{Y | Theta}(y | theta)

when Y and Theta are random variables --- y and theta are just
arbitrary locally scoped variables written with the same sloppy
notation for binding as mathematicians use everywhere else when
talking about functions (boy were my eyes ever opend about that
dx notation in calc when I learned mathematica and saw it spelled
out in lambda calculus). Many Bayesians are even sloppier (or more
concise if you want to put a more positive spin on it),
writing p(y | theta) and leaving the actual random
variables Y and Theta unspoken, or sometimes overloaded and written
as y and theta. That's why it's so hard to write the usual CDF

F_Y(y) = Pr[y < Y]

in BDA-style notation. I found this all very confusing when first trying
to understand stats because the intro math stats books never precisely
define random variables.

My understanding (also pending revision) is that a (real-valued)
random variable Y was a total function Y:Omega -> R where Omega
is the sample space and R is the set of real numbers. I don't
see how context of use changes the notion of what a random variable
is.

- Bob

> On Aug 22, 2016, at 8:50 PM, Michael Betancourt <notifi...@github.com> wrote:
>
> That’s not quite the argument.
>
> The semicolon is purely frequentist and is meant to avoid
> interpreting the likelihood as a conditional probability distribution,
> and I agree that this has no place in Bayesian inference (in some
> sense the difference between | and ; distills the fundamental
> differences between frequentist and Bayesian inference).
>
> Instead the difference is in the definition and use of a random
> variable. When using a conditional probability distribution, any
> variables to the left of the | are having their distribution _defined_
> by the conditional distribution. Any variables to the right of the
> |, however, are just be queried by value independent of their
> distribution. Sequential generative modeling with conditional
> probability distributions is just a way to isolate how a random
> variable is defined and how it is used to influence other random
> variables.
>
> In other words, the interpretation of what is a random variable
> and what isn’t is being locally scoped just to the definition of
> that conditional probability distribution. I understand how this
> would be confusing to users not familiar with probability, and
> even to those who are as the error messages would be
> referring to this very local scope and not the global scope of
> the program where everything is random variable. It was just
> a suggestion. Ultimately anything we choose is either going to
> be ambiguous at some level, inconsistent with the actual math,uses
> or confusing to most users.
>
> On Aug 22, 2016, at 11:03 AM, Bob Carpenter <notifi...@github.com> wrote:
>
> > Understood, but it's the "anymore" that's problematic. They are
> > random variables from the Stan program's perspective. And if you
> > talk to Andrew (or read BDA), then everything's a random variable
> > in Bayesian stats, even the predictors in a regression that come in
> > as data and get no distribution. That's why he refuses to let us use
> > the semicolon notation (the frequentists use that precisely to distinguish
> > the random variables from other variables). Going against BDA, even with
> > technical correctness on our side, seems like a mug's game --- it'll just
> > confuse users and annoy Andrew without much upside.
> >
> > —
> > You are receiving this because you were mentioned.
> > Reply to this email directly, view it on GitHub, or mute the thread.
> >
>
> —
> You are receiving this because you were assigned.
> Reply to this email directly, view it on GitHub, or mute the thread.
>

Andrew Gelman

unread,

Aug 22, 2016, 6:15:42 PM8/22/16

to stan...@googlegroups.com

I've blogged on this . . .

Short answer is that, from a Bayesian perspective (which fits Stan pretty well!), the expression p(a | b) is clearly defined, _without_ b needing to be a random variable. p(a | b) is defined as a nonnegative function for which \int p(a | b) da = 1 for any b. In addition, _if_ b is a random variable, then p(a,b) = p(a | b) * p(b) and all the rest.

To put it another way, old-style probability textbooks define p(a | b) as p(a,b)/p(b). But it's just as good (indeed, I'd argue, better) to define p(a,b) as p(a | b) * p(b).

A

> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Bob Carpenter

unread,

Aug 22, 2016, 6:42:14 PM8/22/16

to stan...@googlegroups.com

> On Aug 23, 2016, at 12:15 AM, Andrew Gelman <gel...@stat.columbia.edu> wrote:
>
> I've blogged on this . . .

And tried to explain it to me in person and point out a
relevant discussion in BDA (p. 354 in the 3rd edition). In BDA
you only say that a full Bayesian model includes a distribution
for X (regression predictors) and then say that in regression
models, p(X | psi) is taken to be be independent of
p(y | X, theta) so that you can ignore the p(X | psi) bit when
looking a the posterior p(theta | X, y). And you say that
X and y are both 'data' (your single quotes). But data just
means you've observed their value conditioned on the "true"
member of the sample space, I think.

> Short answer is that, from a Bayesian perspective (which fits Stan pretty well!), the expression p(a | b) is clearly defined, _without_ b needing to be a random variable. p(a | b) is defined as a nonnegative function for which \int p(a | b) da = 1 for any b.

But don't all your functions named p(...|...) have to be consistent
with each other?

> In addition, _if_ b is a random variable, then p(a,b) = p(a | b) * p(b) and all the rest.
>
> To put it another way, old-style probability textbooks define p(a | b) as p(a,b)/p(b). But it's just as good (indeed, I'd argue, better) to define p(a,b) as p(a | b) * p(b).

Either way, it's confusing to those of us trying to learn this stuff
that the definition of what seems like the same concept *seems* (note
the qualification) to change from place to place.

- Bob

Michael Betancourt

unread,

Aug 22, 2016, 6:48:53 PM8/22/16

to stan...@googlegroups.com

>
> Short answer is that, from a Bayesian perspective (which fits Stan pretty well!), the expression p(a | b) is clearly defined, _without_ b needing to be a random variable. p(a | b) is defined as a nonnegative function for which \int p(a | b) da = 1 for any b. In addition, _if_ b is a random variable, then p(a,b) = p(a | b) * p(b) and all the rest.

Right — the interpretation of b depends on the context.

If you look at the conditional probability distribution alone, it is mostly just
an index defining exactly which distribution over a you are considering
(hence it could also be written as p_{b} (a), which is also not uncommon
in the frequentist literature). Formally there are a few additional properties
that have to be satisfied.

But if you look at the entire model then b might also be determined as
random and given its own distribution, p(b), which then defines the
joint distribution

p(a, b) = p(a | b) p(b).

But it would also just be a hyperparameter that is assumed certain.

The big difference between the semicolon and the bar is that in the semicolon
notation anything to the right of the semicolon _cannot be random ever_. But
in the bar notation anything to the right of the semicolon _could be random_.

> To put it another way, old-style probability textbooks define p(a | b) as p(a,b)/p(b). But it's just as good (indeed, I'd argue, better) to define p(a,b) as p(a | b) * p(b).

Interestingly enough, the definition p(a | b) = p(a, b) / p(b) works only in
discrete spaces. A huge amount of measure theory went into developing
how to define p(a | b) when b itself has measure zero. The result is a mess
(disintegrations do sound cool, though) but ultimately everything ends up
being defined in terms of expressions like

\int f(a, b) p(a, b) da db = \int f(a, b) p(a | b) p(b) da db.

So in a very real sense the product form _is_ more fundamental.

Bob Carpenter

unread,

Aug 22, 2016, 6:56:14 PM8/22/16

to stan...@googlegroups.com

> On Aug 23, 2016, at 12:48 AM, Michael Betancourt <betan...@gmail.com> wrote:
>
> ...

>> To put it another way, old-style probability textbooks define p(a | b) as p(a,b)/p(b). But it's just as good (indeed, I'd argue, better) to define p(a,b) as p(a | b) * p(b).
>
> Interestingly enough, the definition p(a | b) = p(a, b) / p(b) works only in
> discrete spaces. A huge amount of measure theory went into developing
> how to define p(a | b) when b itself has measure zero. The result is a mess
> (disintegrations do sound cool, though) but ultimately everything ends up
> being defined in terms of expressions like
>
> \int f(a, b) p(a, b) da db = \int f(a, b) p(a | b) p(b) da db.
>
> So in a very real sense the product form _is_ more fundamental.

OK. I see this in your new(ish) intro to probability theory.
Sounds like Andrew's on board. It would've saved me a whole of
grief trying to understand what a conditional meant! Everyone
cheats by starting with the discrete case then handwaving the
continuous one. But given that most textbooks and classes define
p(a | b) in terms of random variables (even if the definitions
don't work), then I'll still stand by my assertion that we
should just call them the first and second arguments to be
neutral yet still precise (if not fully explanatory and if not
quite getting to the bigger point Michael wanted to make with
the | notation, which I really do get).

- Bob

Michael Betancourt

unread,

Aug 22, 2016, 7:05:13 PM8/22/16

to stan...@googlegroups.com

Agreed — and why I noted that there will be no error message that is

easy for beginnings to understand and is mathematically correct. We

could try making up new vocabulary, like “active” variables or some BS

like that but it would cause more harm than good. So I would personally

be fine with something more operational as you suggest.

Bob Carpenter

unread,

Aug 22, 2016, 7:09:14 PM8/22/16

to stan...@googlegroups.com

I think "between first and second argument" is mathematically
(or at least computationally/logically) correct, just not
descriptive about what those arguments mean. But that's probably
what you mean by "operational", so I think we're on the same page now.

As always, thanks for the all the tutoring! I really appreciate it.

- Bob

Reply all

Reply to author

Forward

random variables, ; and | notation

Bob Carpenter

Andrew Gelman

Bob Carpenter

Michael Betancourt

Bob Carpenter

Michael Betancourt

Bob Carpenter