multivariate probit R simulation and Stan model

600 views
Skip to first unread message

Bob Carpenter

unread,
Apr 21, 2015, 9:02:36 PM4/21/15
to stan-...@googlegroups.com
I feel like a traveling salesman who at each stop is asked
to perform a new demonstration of Stan. For today's demo,
Michael McCarthy, a quant ecologist in Melbourne wanted a
multivariate probit for some occupancy models they're working
on. Ask and ye' shall receive.

I'm attaching an R simulation script and Stan model to
fit a multivariate probit:

http://en.wikipedia.org/wiki/Multivariate_probit_model

This is a model where each observation y[n] is a binary D-vector
of values and comes with a real-valued K-vector of predictors x[n].
The point is to model the correlation of the responses in the
binary values for each output.

The multivariate probit is not to be confused with multinomial
probit (where each observation y[n] is a single value in 1:K).

Getting the multivariate probit coded in Stan involves

* the latent variable parameterization where each y[n,d] gets
a parameter z[n,d]

* declaring z[n,d] to have a lower bound of 0 if y[n,d] = 1 and
an upper bound of 0 if y[n,d] = 0.

- this involved the transformed data to split the y
into positive and negative indexings

- declaring separate y vectors with appropriate constraints

- using transformed params to put them back togetehr

* using a correlation matrix rather than covariance matrix to identify
the scale (see the Wikipedia article for a 2D example)

- then putting it on the Cholesky factor parameterization to
allow both a prior and some degree of efficiency.

- reconstructing the correlation matrix in the generated quantities block

It would also be possible to deal with unknown values of y or
new predictions. For missing values of y, there'd be a third parallel
array of unconstrained vectors that'd go in the same way. For predictions,
you'd want to do multi_normal RNGs in the generated quantities block and
threshold for efficiency.

- Bob

P.S. Next step would be figuring out how to encapsulate some of that
work in the .stan file into functions that could be reused. It'll be
crazy making to do this on a model-by-model basis.

probit-multi-sim.R
probit-multi.stan

Andrew Gelman

unread,
Apr 21, 2015, 9:04:00 PM4/21/15
to stan-...@googlegroups.com
Hi, you can also feel free to blog these sorts of messages (including, in particular, this one) because I think they are of general interest.
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
> <probit-multi-sim.R><probit-multi.stan>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.


Bob Carpenter

unread,
Apr 21, 2015, 9:12:35 PM4/21/15
to Andrew Gelman, stan-...@googlegroups.com
OK. Will do when I get some time. I'd want to dump out
the results and perhaps provide a bit more explanation to
a general audience --- I'd fear they'd think it looked too
ugly compared to BUGS. (Of course, BUGS can't really fit
these models because it's so bad at multivariates.)

This one actually needs more than the 200 iterations I gave
it. My first example went very quickly, but the next problem
I randomly generated was probably harder. Still close enough
for stats.

- Bob

Ben Goodrich

unread,
Apr 21, 2015, 10:56:02 PM4/21/15
to stan-...@googlegroups.com
On Tuesday, April 21, 2015 at 9:02:36 PM UTC-4, Bob Carpenter wrote:
  * using a correlation matrix rather than covariance matrix to identify
    the scale (see the Wikipedia article for a 2D example)

       - then putting it on the Cholesky factor parameterization to
         allow both a prior and some degree of efficiency.

The .stan program needs a

lkj_corr_cholesky(eta);

Uniform on L_Omega implies a weird set of prior beliefs about Omega.

Ben

Bob Carpenter

unread,
Apr 22, 2015, 5:26:35 AM4/22/15
to stan-...@googlegroups.com
Thanks --- I thought I had that in there.

Also, it looks like it needs to run for quite a few more iterations
than 200 in general. I must've gotten lucky on my first simulation and
fit!

I hope everyone keeps in mind that correlations have a lot of sample
variation at this level and hare hard to estimate accurately.
That is, Stan's not failing here because the posterior mean of the correlation
parameters doesn't match the value in the simulated correlation matrix --- look
at the interval coverage, which is better, but still not perfect.

- Bob
Reply all
Reply to author
Forward
0 new messages