I feel like a traveling salesman who at each stop is asked
to perform a new demonstration of Stan. For today's demo,
Michael McCarthy, a quant ecologist in Melbourne wanted a
multivariate probit for some occupancy models they're working
on. Ask and ye' shall receive.
I'm attaching an R simulation script and Stan model to
fit a multivariate probit:
http://en.wikipedia.org/wiki/Multivariate_probit_model
This is a model where each observation y[n] is a binary D-vector
of values and comes with a real-valued K-vector of predictors x[n].
The point is to model the correlation of the responses in the
binary values for each output.
The multivariate probit is not to be confused with multinomial
probit (where each observation y[n] is a single value in 1:K).
Getting the multivariate probit coded in Stan involves
* the latent variable parameterization where each y[n,d] gets
a parameter z[n,d]
* declaring z[n,d] to have a lower bound of 0 if y[n,d] = 1 and
an upper bound of 0 if y[n,d] = 0.
- this involved the transformed data to split the y
into positive and negative indexings
- declaring separate y vectors with appropriate constraints
- using transformed params to put them back togetehr
* using a correlation matrix rather than covariance matrix to identify
the scale (see the Wikipedia article for a 2D example)
- then putting it on the Cholesky factor parameterization to
allow both a prior and some degree of efficiency.
- reconstructing the correlation matrix in the generated quantities block
It would also be possible to deal with unknown values of y or
new predictions. For missing values of y, there'd be a third parallel
array of unconstrained vectors that'd go in the same way. For predictions,
you'd want to do multi_normal RNGs in the generated quantities block and
threshold for efficiency.
- Bob
P.S. Next step would be figuring out how to encapsulate some of that
work in the .stan file into functions that could be reused. It'll be
crazy making to do this on a model-by-model basis.