prediction/simulation/causal analysis

174 views
Skip to first unread message

Ross Boylan

unread,
Dec 26, 2013, 6:07:22 PM12/26/13
to stan-users
This is a question about interpreting stan models, not estimating them.
I don't think it's really stan-specific, but I thought I'd see if anyone
has suggestions. I am trying to compute effects, defined as the average
change in outcomes. The problem is that for each set of coefficients I
have to draw repeated simulations, and it's taking forever. Is there a
better way?

I want to find the average causal effect of some variable for my sample:
define the effect, delta, to be mean(B | low) - mean(B | high).
B is an outcome; the mean is across members of the sample; low and high
are versions of the data with some variable A manipulated to take on low
or high values (e.g., observed - 1/2 and observed + 1/2).
The distribution of B depends on covariates X through a model with
parameters b. A is one of the covariates.

I want the posterior distribution of delta, or at least the mean and
standard deviation of that distribution.

Given the posterior distribution of b, estimated with stan, one can draw
b(i), the i'th simulated set of coefficients, and compute delta(i) if
the model is straightforward.

If not, one can simulate outcomes B for each set of coefficients, giving
delta(i, j) where i is the i'th set of coefficients and j indexes
simulated outcomes for those coefficients. One can estimate delta(i)
with the mean over j of the delta(i, j).

There are 2 reasons the direct computation of delta(i) does not seem
feasible for me. The main one is that I am working in a multi-stage
model, in which A->B->C->D, A causes B, B causes C, and so on. There
are also control variables, and earlier variables can have direct as
well as indirect effects on later ones. This means that one set of
coefficients for the A->B model create a distribution of B's, and each
distribution of B's implies a different set of downstream estimates.
This seems to leave me with numerical integration as the only reasonable
strategy.

The minor problem is that computing the expected outcome for each
individual, given all the covariates and a set of parameter values, is
not always straightforward because the models may have truncation,
censoring, or mixtures.

At the moment B is a count; C is multinomial; and D is binary.

I'm thinking of capturing the predicted means directly when feasible.
This should increase precision, but does not eliminate the need to draw
repeatedly for each parameter set.

Ross Boylan


Andrew Gelman

unread,
Dec 27, 2013, 11:18:59 AM12/27/13
to stan-...@googlegroups.com
Ross:
I think you can just draw one simulation of the outcome for each parameter value, so you can do that in the "predictive quantities" block or whatever it is called, right? But maybe there is something I'm missing here.
Also, I assume you've seen this paper:
http://www.stat.columbia.edu/~gelman/research/published/ape17.pdf
A
> --
> You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.


Ross Boylan

unread,
Dec 27, 2013, 3:22:35 PM12/27/13
to stan-...@googlegroups.com
On Fri, 2013-12-27 at 17:18 +0100, Andrew Gelman wrote:
> Ross:
> I think you can just draw one simulation of the outcome for each parameter value, so you can do that in the "predictive quantities" block or whatever it is called, right? But maybe there is something I'm missing here.
If all I wanted was an estimate of the mean effect, I think doing one
simulation per parameter set would suffice, but I don't think that gives
me the uncertainty I need.

Concretely, I can simulate draws from a count distribution, and compare
the average counts between the high and low scenarios. But the
variability in the outcome arises partly from variability in the
parameters and partly from the variability induced by the random process
for a given set of parameters (mechanically, the last part corresponds
to the simulation given parameter values). I only want the former.
Since I'm taking means for my sample of individuals (or subsets of them)
I thought the simulation induced variability would be trivial (since I'm
taking the mean of many simulated values), but some testing showed it is
not.

A more mechanical problem is that I'm estimating each model separately
(e.g., A->B, A+B->C, ...) but I need to combine them all at the end. So
I can't do predictions within a single model estimation.

In a perfect world I would estimate all the models jointly, and thus
avoid assuming the coefficients between models were independent, but
estimating them singly was hard enough.

A final mechanical issue is that the prediction is not based on the data
used to fit the model, but a modified version of that data.

> Also, I assume you've seen this paper:
> http://www.stat.columbia.edu/~gelman/research/published/ape17.pdf
> A
I hadn't. Thank you for the reference.

I looked around in the 3rd edition of BDA but didn't find much; my
impression is the 2nd edition had a bit more, though I haven't checked
it.

Ross

P.S. Fortunately I've been able to speed up the R code, though the
post-analysis is still much slower than the model estimation, which
seems off. It took about 900 CPU hours to evaluate 2000 sets of
coefficients with 400 simulations for each. In contrast, I think
fitting all the models took only a handful of CPU hours.

Andrew Gelman

unread,
Dec 27, 2013, 4:39:52 PM12/27/13
to stan-...@googlegroups.com

On Dec 27, 2013, at 9:22 PM, Ross Boylan wrote:

> On Fri, 2013-12-27 at 17:18 +0100, Andrew Gelman wrote:
>> Ross:
>> I think you can just draw one simulation of the outcome for each parameter value, so you can do that in the "predictive quantities" block or whatever it is called, right? But maybe there is something I'm missing here.
> If all I wanted was an estimate of the mean effect, I think doing one
> simulation per parameter set would suffice, but I don't think that gives
> me the uncertainty I need.

Ahh, I see what you are saying. In this case, I suspect that two random draws of new data per simulation of the parameters would suffice.
No, the place we discuss this is in my book with Jennifer Hill. But the treatment in the paper with Pardoe is more comprehensive.

>
> Ross
>
> P.S. Fortunately I've been able to speed up the R code, though the
> post-analysis is still much slower than the model estimation, which
> seems off. It took about 900 CPU hours to evaluate 2000 sets of
> coefficients with 400 simulations for each.

With only 2 each it should be faster!

Ross Boylan

unread,
Jan 3, 2014, 7:16:37 PM1/3/14
to stan-...@googlegroups.com
[This is really for Andrew as corresponding author of the Gelman and Pardoe 2007 paper http://www.stat.columbia.edu/~gelman/research/published/ape17.pdf--hereafter G&P--mentioned in previous messages, but I figured there might be some value to keeping it on list.]

G&P advocates doing average predictive comparisons for a focal variable u, with v representing all other variables, by computing a weighted average of predicted differences across the space of u x u, in practice all possible increasing combinations of u values (e.g., equations 2 & 5).  It refers to these as "transitions" from u(1) to u(2) (p. 32) and weights them according to the likelihood of the transition (p. 37).

Following the causal effects literature, e.g., Pearl's ftp://ftp.cs.ucla.edu/pub/stat_ser/r379.pdf, I've been doing something different, and wonder if I'm missing something.  The approach there is to create u(1) and u(2) by direct manipulation, e.g., u(1) = observed u - 1/2 and u(2) = observed u + 1/2.
In that case rather than evaluate an n x n grid of u(1) x u(2) values one has only n pairs to evaluate, or n x 2.  n is the number of observations.

The G&P approach seems odd for several reasons, with 3 being my main concern:
  1. The things referred to as transitions do not refer to actual transitions in the real (or counterfactual) world in which u changed from u(1) to u(2). 
  2. In complicated situations some u values might be incompatible with some v values, so that no outcome could be predicted for some u, v combinations.  As a practical matter one could work around this, but it suggests the oddity of considering all u values for each case.
  3. As the paper points out, effects are not generally linear in u(1)-u(2).    But it looks as if the weighting scheme is effectively averaging [E(y|u(1)) - E(y|u(2)] / [u(1)-u(2)].  The result apparently is to be interpreted as the predicted difference of a unit change in u, but it is actually coming from a linearized version of differences over varying scales.  This might be of some interest if u were subject to real transitions with the assumed distribution, but that interpretation seems inappropriate in most settings (repeating point 1).
  4. It's more work to compute over n x n than n x 2, especially when the n x n computation also involves coming up with some weighting function.

In contrast, with the Pearl approach, taking the points in order:

  1. The change from u(1) to u(2) correspond to hypothetical states of the world, sometimes to possible experimental manipulations.
  2. The counterfactual levels of u must be consistent with the observed v.
  3. The effects correspond to a fixed change in u.
  4. less work.

What am I missing?

I realize that "predicted differences" is an attempt to avoid the claim that the differences are causal, but I don't see why that important interpretive issue bears on the different expectations being taken in G&P vs Pearl.

Thanks.

Ross Boylan

P.S. What I'm doing differs in other ways, mostly in that I do not have ready access to E(y|u, v) (while I do have ready access to simulated values of y given u and v) , from the paper's setup, but that's a different issue.

Andrew Gelman

unread,
Jan 8, 2014, 11:19:10 AM1/8/14
to stan-...@googlegroups.com, Aki Vehtari
Hi, a few thoughts:

0.  The type of variable (discrete or continuous) of u is important.  You talk about u - 1/2 or u + 1/2, but if u is discrete, this will not be possible.  So in general we need to think about comparisons of u based on its distribution.
1.  We tried to be clear that these are predictive comparisons, not causal comparisons,  The analogy is to regression coefficients, not to causal estimates.  I think both of these are useful goals, and I agree they should not be confused.
2.  Yes, the issue of dependence of u and v is a concern.  I think this actually is handled in our method, note the terms p(u|v) so that we are including the u's conditional on the v.
3.  It would be possible to do things on nonlinear scales.  For my applications I wanted to understand the models in terms of changes in probabilities.  But it depends on one's applied goals.
4.  Unfortunately we have not yet programmed a general implementation of our method, indeed if you look carefully there are various open choices for example in the level of smoothing that is used in the averaging.

Regarding the Pearl approach, I haven't looked at it but from your description it seems like it is answering a different question, which I think would make it more useful in some settings and less useful in others.

A

P.S.  I'm cc-ing Aki because he is interested in this topic.

Aki Vehtari

unread,
Jan 9, 2014, 7:26:22 AM1/9/14
to stan-...@googlegroups.com, gel...@stat.columbia.edu
Hi all,

I think Andrew answered well, although he could have been more specific in 2: If u is not consistent with v (so that [u v] is outside the observed data), then that transition gets a small weight.

Aki
Reply all
Reply to author
Forward
0 new messages