tips on improving mixing in a multivariate mixed response model

1,616 views
Skip to first unread message

terrance savitsky

unread,
Nov 14, 2012, 7:19:46 PM11/14/12
to stan-...@googlegroups.com
Hi, I've posted a model on this board a few times, along the way, that inputs multiple responses, each as multivariate matrices.   They are all linked through the sharing of sets of multivariate random effects.   Ben worked with me to optimize the auto diff processing and RAM management.   Let's focus on one of the sets of multivariate random effects, Delta[t,d].   Running stan for 2500 iterations (of which 1000 are burnin) demonstrates terribly poor mixing of Delta (as shown in the attached plots. Each plot contains a random selection of Delta[t,d] parameters).   Essentially both of the chains I instantiate barely move.   I run a similar model in JAGS, but under conjugate priors - e.g. Wishart priors on the precision matrices of the sets of multivariate random effects, such as Delta).   I wanted to employ a less informative construction, and so imposed uniform priors on the correlation matrices under stan.   Anyway, the mixing in JAGS is very good.   

stan is forced to choose very small step sizes to make the chain go.   This performance counters my intuition for the random walk suppression of NUTS.   The only idea that occurs to me is that 3 of the 6 response matrices are ordinal.   Some of the associated cutpoints (that lack observations) mix poorly in JAGS, though such does not affect the robust mixing of the random effects, Delta, there.  Since NUTS moves the entire parameter space on each iteration, I wonder if all the step sizes for all the parameters may be constrained by a poorly mixing subset.   Such would not seem to be the case (unless there is a high correlation among them that is not ameliorated by the hamiltonian).  

Thanks for any tips you would offer.

--
Thank you, Terrance Savitsky
ordMultiProt_E.stan
traceplots.pdf
autocorr.pdf

Ben Goodrich

unread,
Nov 15, 2012, 2:11:48 AM11/15/12
to stan-...@googlegroups.com
Can you first tell us about which quantities in the parameters {} block have especially good or bad mixing? For Delta, which is a transformed parameter, there is an additional layer of complication due to the transformation(s). But the stuff in the parameters {} block is more directly relevant to the question of whether too many proposals are being rejected or the step sizes are too small. I would guess that the problem is that some parameters are very correlated. If it is a relatively simple case of high pairwise correlation(s), then that can be checked in the draws from the posterior distribution that you have.

Ben

terrance savitsky

unread,
Nov 15, 2012, 12:53:20 PM11/15/12
to stan-...@googlegroups.com
Hi, Well, for memory reasons i didn't return the parameters which are transformed to Delta, so I'll have to instantiate another run to grab those.  That said, I did return the (directly sampled) cutpoint parameter vectors, gamma1_u, gamma2_u, gamma3_u, and their trace plots look very similar to those for the transformed parameter matrix, Delta; that is, small step sizes and high autocorrelations.  The posterior values are very slowly exploring the posterior distributions.  Back to Delta, so the construction to sample Delta_norm as uncorrelated and then use the cholesky factor of the covariance matrix to transform to Delta follows the suggested process in order to minimize the correlations among the resulting Delta.

I'm achieving more information per sample in JAGS than stan.   I was hoping the employment of the redundant parameters in the hamiltonian function would suppress random walk behavior and allow for larger steps under presence of pairwise correlations so that I could get away with only sampling a few thousand iterations.   the jags approach may provide more informative samples - in this case - because it samples conditionally rather than jointly.  i thought such wouldn't really matter under NUTS.   so if the answer is that the parameters express moderate-to-high pairwise correlations and that explains small step sizes under NUTS, then such would be disappointing.


--
 
 

Bob Carpenter

unread,
Nov 15, 2012, 1:10:40 PM11/15/12
to stan-...@googlegroups.com
HMC is rotation invariant, so correlations alone aren't
going to cause problems. But it's not scale invariant.

Gibbs, on the other hand, is scale invariant, but not
rotation invariant.

Even basic HMC crushes Gibbs on unit-variance, highly-correlated
multivariate normals (assuming you don't cheat and have Gibbs
do a multivariate exact update!).

The varying step-size adaptation we do in Stan helps somewhat
with global scale issues, but can't deal with the local scale
issues exemplified with Neal's funnel distribution or
non-orthogonal scale exemplified by a multivariate normal with
high correlations and widely varying variances.

It's possible in some cases to transform the geometry of the
sampler and improve sampling. The idea's to normalize the varying
scale introduced by parameters such as hierarchical deviation parameters
which have strong effects on lower-level parameters when the deviations
are tiny and highly constraining.

Ideally, you'd have both scale and rotation invariance, but doing that
requires something like Riemann Manifold HMC, which is very
computationally demanding (N^3 in the number of dimensions per
leapfrog step).

- Bob
> --
>
>

Marcus Brubaker

unread,
Nov 15, 2012, 1:39:14 PM11/15/12
to stan-...@googlegroups.com
Actually, HMC is only rotation invariant with a scaled identity mass
matrix. NUTS II breaks this, making its performance rotation
dependent because of the more general diagonal mass matrix. However,
in general, NUTS II should always be better than NUTS I.

To illustrate, consider two independent variables one with small scale
and one with large scale. With NUTS I, the mass is set based on the
smaller of the two scales and is unaffected by a rotation of the
parameter space. NUTS II sets two different scales for the
independent variables, but when they're rotated it is forced to set
the scales based on the diagonal of the rotated scales which will be
somewhere between the smaller and larger scale and produce worse
performance than the unrotated case.

Cheers,
Marcus
> --
>
>

Ben Goodrich

unread,
Nov 15, 2012, 1:48:15 PM11/15/12
to stan-...@googlegroups.com
On Thursday, November 15, 2012 12:53:22 PM UTC-5, tds151 wrote:
Hi, Well, for memory reasons i didn't return the parameters which are transformed to Delta, so I'll have to instantiate another run to grab those.  That said, I did return the (directly sampled) cutpoint parameter vectors, gamma1_u, gamma2_u, gamma3_u, and their trace plots look very similar to those for the transformed parameter matrix, Delta; that is, small step sizes and high autocorrelations.  The posterior values are very slowly exploring the posterior distributions.  Back to Delta, so the construction to sample Delta_norm as uncorrelated and then use the cholesky factor of the covariance matrix to transform to Delta follows the suggested process in order to minimize the correlations among the resulting Delta.

Bob and Marcus explained things in more detail than I did. But what I was implying was that Delta is less likely to be the root cause of the problem because you are composing it from a Cholesky factorization of the covariance matrix and independent standard normals. So, the parameter space along those directions is more likely to be less correlated and more likely to have similar scales. Some other parameter, maybe the cutpoints, could be adversely affecting the whole model particularly if there is both high correlation and widely varying scales.

Ben

Bob Carpenter

unread,
Nov 15, 2012, 1:59:47 PM11/15/12
to stan-...@googlegroups.com


On 11/15/12 1:39 PM, Marcus Brubaker wrote:
> Actually, HMC is only rotation invariant with a scaled identity mass
> matrix.

Sorry -- should've made that clearer. For Stan, that
means using the

--equal_step_sizes

option in the command line.

Otherwise, you get the NUTS II behavior of varying step
sizes (equivalently, diagonal mass matrix), which as Marcus
points out, is not rotation invariant.

> NUTS II breaks this, making its performance rotation
> dependent because of the more general diagonal mass matrix. However,
> in general, NUTS II should always be better than NUTS I.

Assuming it can adapt to the right per-variable scales during
warmup, of course. So far, it's worked very well for us, but
there aren't any guarantees.

- Bob

terrance savitsky

unread,
Nov 15, 2012, 9:14:41 PM11/15/12
to stan-...@googlegroups.com
it seems i would want the NUTS II version, which is probably very important for a black box implementation across all model parameters.  Anyway, thanks for the helpful explanations as they add a good perspective on using stan effectively.   To Ben's comment, I think it is worth my looking at the scale of the cutpoints.   Given they are ordered is there an obvious way to you for separating the cutpoint parameter into a product of a de-scaled and scaled parameters?




- Bob

--


Ben Goodrich

unread,
Nov 15, 2012, 9:42:42 PM11/15/12
to stan-...@googlegroups.com
On Thursday, November 15, 2012 9:14:43 PM UTC-5, tds151 wrote:
it seems i would want the NUTS II version, which is probably very important for a black box implementation across all model parameters.  Anyway, thanks for the helpful explanations as they add a good perspective on using stan effectively.   To Ben's comment, I think it is worth my looking at the scale of the cutpoints.   Given they are ordered is there an obvious way to you for separating the cutpoint parameter into a product of a de-scaled and scaled parameters?

If your model allows the cutpoints to be non-negative, then you could try operationalizing the cutpoint vector as the cumulative sum of a rescaled simplex vector.

Ben

terrance savitsky

unread,
Nov 16, 2012, 3:17:10 PM11/16/12
to stan-...@googlegroups.com
thank you.  just want to clarify that there is not a cumsum function in stan returning a vector.   i will then write it in the stan script (in the absence of partial indexing).


--
 
 

Bob Carpenter

unread,
Nov 16, 2012, 8:03:56 PM11/16/12
to stan-...@googlegroups.com
And if not, you can rescale AND relocate a simplex.
That's sort of what the ordered types look like
inside anyway -- just a normalized stick-breaking-like
process.

- Bob

Ben Goodrich

unread,
Nov 16, 2012, 8:23:14 PM11/16/12
to stan-...@googlegroups.com
On Friday, November 16, 2012 3:17:11 PM UTC-5, tds151 wrote:
thank you.  just want to clarify that there is not a cumsum function in stan returning a vector.   i will then write it in the stan script (in the absence of partial indexing).

Actually, instead of rescaling and / or relocating a simplex, it might work better to consider the cumulative sum of a simplex to be chunks of the logistic CDF. If you feed those into the inverse CDF, which is known in the case of the logistic distribution, then you get cutpoints out.

Ben

terrance savitsky

unread,
Nov 20, 2012, 12:33:55 PM11/20/12
to stan-...@googlegroups.com
Please recall that I experienced a poorly mixing set of chains, possibly due to the lack of scale invariance of HMC.   the stan script for the poorly mixing model already sampled a set of standard normals and transformed them to dependent multivariate normal parameters.  so all that was left with scale were a set of cutpoints associated to ordinal observations.   the attached script follows your helpful suggestion to scale and locate the cumsum of a simplex vector for constructing the cutpoints.   (As an aside, one cannot treat the cumsum of a simplex as a cdf value and then inverse transform it because the simplex sums to 1).   i now receive the following sampling error (on multiple instantiated runs):

Iteration:  500 / 2500 [ 20%]  (Adapting)
Error in sampler$call_sampler(args_list[[i]]) : std::bad_alloc
error occurred during calling the sampler; sampling not done

I've also implemented some minor flavors of the above approach, including constraining the cutpoints to be positive so that i remove the location parameter.  I receive the exact same error message.   All to say, it appears that stan doesn't like the approach of constructing cutpoints from a simplex.   I'd appreciate any suggestions or insights you may offer.

Thanks, Terrance.


--
 
 
ordMultiProt_E_ucut2.stan

Bob Carpenter

unread,
Nov 20, 2012, 2:57:00 PM11/20/12
to stan-...@googlegroups.com

On 11/20/12 12:33 PM, terrance savitsky wrote:
> Please recall that I experienced a poorly mixing set of chains, possibly due to the lack of scale invariance of HMC.
> the stan script for the poorly mixing model already sampled a set of standard normals and transformed them to dependent
> multivariate normal parameters. so all that was left with scale were a set of cutpoints associated to ordinal
> observations. the attached script follows your helpful suggestion to scale and locate the cumsum of a simplex vector
> for constructing the cutpoints. (As an aside, one cannot treat the cumsum of a simplex as a cdf value and then inverse
> transform it because the simplex sums to 1). i now receive the following sampling error (on multiple instantiated runs):
>
> Iteration: 500 / 2500 [ 20%] (Adapting)
> Error in sampler$call_sampler(args_list[[i]]) : std::bad_alloc
> error occurred during calling the sampler; sampling not done

This just means you've run out of memory.
The simplex parameerization will be a bit more
memory intensive, so if there are a lot of them and
you were on the edge of memory limits before, this
could put you over.

> I've also implemented some minor flavors of the above approach, including constraining the cutpoints to be positive so
> that i remove the location parameter. I receive the exact same error message. All to say, it appears that stan
> doesn't like the approach of constructing cutpoints from a simplex.

This should be OK in Stan. I'm thinking it's probably
just the memory. If you want to send us the model code,
I could check that we're talking about the same thing.
What I was thinking from Ben's suggestion was something
like this:

parameters {
simplex[K] theta;
...
transformed parameters {
vector[K-1] cutpoints;
real sum;
sum <- 0.0;
for (k in 1:(K-1)) {
sum <- sum + theta[k];
cutpoints[k] <- inv_logit[sum];
}
...

The implicit prior here is uniform on theta,
with whatever the impliciations are of that for
the cutpoints.

The alternative is to just use

parameters {
ordered[K-1] cutpoints;
}

This model's not proper in that you need priors on
the cutpoints because they're unconstrained.

There's also a pos_ordered data type that makes the
first cutpoint non-negative (hard to avoid 0 with
underflow issues).

- Bob

terrance savitsky

unread,
Nov 20, 2012, 5:52:49 PM11/20/12
to stan-...@googlegroups.com
Hi Bob, I've had other RAM issues under stan runs, so your explanation seems correct.   I had moved my stan runs to a 64GB RAM machine, but I am not the only user.   I've re-started a run with a stan script version that maximally employs local variables in the model block to reduce the memory load of the auto diff.   Thanks, too, for clarifying Ben's suggestion and also with a more efficient cumsum script.  Silly me.  Your writing it out explicitly for me also reminded me that I'm dealing with a simplex, which sums to 1, so I need to define a simplex of length K to extract K-1 random entries.  I had been using only a K-1 simplex, so that the last cutpoint (after performing the cumsum) was the just the value of the multiplicative scale parameter used to uplift the cumsum(simplex).  This is not what I wanted to do.   Thanks again.




- Bob

--


Ben Goodrich

unread,
Nov 20, 2012, 7:39:39 PM11/20/12
to stan-...@googlegroups.com
On Tuesday, November 20, 2012 2:57:04 PM UTC-5, Bob Carpenter wrote:
 
What I was thinking from Ben's suggestion was something
like this:

parameters {
   simplex[K] theta;
   ...
transformed parameters {
   vector[K-1] cutpoints;
   real sum;
   sum <- 0.0;
   for (k in 1:(K-1)) {
     sum <- sum + theta[k];
     cutpoints[k] <- inv_logit[sum];
   }
   ...

The last line should be

cutpoints[k] <- logit(sum);

The implicit prior here is uniform on theta,
with whatever the impliciations are of that for
the cutpoints.

In the manual for 1.0.4 I put a subsection on composing a simplex from standard normals such that the distribution of the simplex is distributed Dirichlet. That might be the most geometrically friendly alternative, although it won't help with the RAM shortage.

Ben

terrance savitsky

unread,
Dec 26, 2012, 8:23:15 PM12/26/12
to stan-...@googlegroups.com
Hi, I've continued to work with a prototype model with multiple response likelihoods specified as either continuous or ordered.  Each response is multivariate and composed from nested sets of random effects shared across response likelihoods.   I had experienced poor mixing over positive_ordered cutpoints (for 3 of 6 response likelihoods), so I followed helpful coaching to generate the cutpoints from scaled cumulative_sum()s of simplex vectors to remove the scale under sampling.   This approach compiles and runs under rstan 1.1.0 under R 2.15.1 and Rcpp 10.2.   

I decided to try Ben's approach of generating sets of simplex vectors as dirichlet distributed from standard normals as a way to further improve the sampling geometry under NUTS.   My implementation crashes R.  All else is in this stan script is the same as the model that runs, save for the constructing of the simplex vectors from standard normals.   So I'd appreciate if you would have a quick look at my implementation of Ben's method for something I might have improperly done.

The focus variables are:  z1, z2, z3 (the standard normal variates) in the parameters {} block and gamma1_u, gamma2_u, gamma3_u (the resulting simplexes) in the transformed parameters{} block.   The further scaling of, say, gamma1_u to gamma1 is identical to that in the script that compiles and runs.

Lastly, a minor error on page 74 of the documentation: "z ~ normal(0,1)" appears not to be correct as z is not a vector - maybe put it in a loop to sample z[k], k = 1, ... K.

As always, thanks for your help and expertise.  Terrance.

> In the manual for 1.0.4 I put a subsection on composing a simplex from standard normals such that the distribution of the simplex is distributed Dirichlet. That might be the most geometrically friendly alternative, although it won't help with the RAM shortage.


do you mean de-scaled gamma


--
 
 
ordMultiProt_E_norm.stan

Bob Carpenter

unread,
Dec 27, 2012, 4:11:51 AM12/27/12
to stan-...@googlegroups.com


On 12/26/12 8:23 PM, terrance savitsky wrote:

> Lastly, a minor error on page 74 of the documentation: "z ~ normal(0,1)" appears not to be correct as z is not a vector
> - maybe put it in a loop to sample z[k], k = 1, ... K.

Thanks. That's definitely a mistake and I just fixed the
manual for the next release. This is the problem with
having these model sketches that don't get compiled!

- Bob

terrance savitsky

unread,
Dec 28, 2012, 11:32:10 AM12/28/12
to stan-...@googlegroups.com
Hi again,  I'd appreciate any comment on my implementation of Ben's sketch for generating dirichlet-distributed simplex vectors from sets of standard normals.   My implementation is crashing R, but I'm unable to discover what is incorrect in my specification.

A related question:  Would it produce faster stan runs for me to replace a ordered_logistic implementation with a direct lp_ augmentation for a multinomial likelihood of category probabilities?  

Thank you, Terrance Savitsky

Ben Goodrich

unread,
Dec 28, 2012, 1:54:19 PM12/28/12
to stan-...@googlegroups.com
On Friday, December 28, 2012 11:32:10 AM UTC-5, tds151 wrote:
Hi again,  I'd appreciate any comment on my implementation of Ben's sketch for generating dirichlet-distributed simplex vectors from sets of standard normals.   My implementation is crashing R, but I'm unable to discover what is incorrect in my specification.

We need whatever output is produced when the model is run after executing set_cppo("debug") in R and ideally the data that produces the problem. I tried with one of the data files you sent previously, but that didn't have all of the right variables.
 
A related question:  Would it produce faster stan runs for me to replace a ordered_logistic implementation with a direct lp_ augmentation for a multinomial likelihood of category probabilities?  

One could, in principle, do something for ordered_logistic that is similar to what is done by bernoulli_logit. But that is a third-order problem.

Ben
 

terrance savitsky

unread,
Dec 28, 2012, 9:39:12 PM12/28/12
to stan-...@googlegroups.com



1. Sorry that I didn't earlier employ set_cppo("debug").   I assumed I was successfully compiling and linking, but encountering a runtime problem.  Anyway, I receive an odd error message about failure to compile due to insufficient memory.  There is a warning statement regarding the free_memory() function of agrad.  If I run the script under fast compile O3 mode, R crashes.  The error response is odd b/c it occurs during the compile phase and I'm running the script on a machine with 80GB of unused RAM.  During the compilation step, a maximum of 1.2GB is used.   That said, the processor on the machine is nearly maxed out.

Error in compileCode(f, code, language = language, verbose = verbose) : 
  Compilation ERROR, function(s)/method(s) not created! cygwin warning:
  MS-DOS style path detected: C:/PROGRA~1/R/R-215~1.1/etc/x64/Makeconf
  Preferred POSIX equivalent is: /cygdrive/c/PROGRA~1/R/R-215~1.1/etc/x64/Makeconf
  CYGWIN environment variable option "nodosfilewarning" turns off this warning.
  Consult the user's guide for more details about POSIX paths:
D:/Users/savitsky/Documents/R/win-library/2.15/rstan/include//stansrc/stan/agrad/agrad.hpp:2171:17: warning: 'void stan::agrad::free_memory()' defined but not used [-Wunused-function]
c:/rtools/gcc-4.6.3/bin/../lib/gcc/i686-w64-mingw32/4.6.3/../../../../i686-w64-mingw32/bin/as.exe: file2d646a386305.o: too many sections (39205)
D:\Users\savitsky\AppData\Local\Temp\20\ccZoO8Jj.s: Assembler messages:
D:\Users\savitsky\AppData\Local\Temp\20\ccZoO8Jj.s: Fatal error: can't write file2d646a386305.o: File too big
c:/rtools/gcc-4.6.3/bin/../lib/gcc/i686-w64-mingw


2. I attach an .RDump file with the data and the associated stan script

3. I have another script running where the cutpoints are generated directly from scaled simplex vectors.   It is currently consuming about 500MB of RAM and, oddly, not increasing (though I am running it from R so that I assume sampled values are written to disk), leading me to believe that the per iteration speed is slow.   (It's been running for something like 50 hours and hasn't completed 2500 iterations).   All to say, I wondered if the ordered_logistic() likelihood increments might be an issue.   Such prompted my question about an alternative to directly increment lp_ under a multinomial likelihod from the category probabilities, but there is nothing inherently more efficient in the specification, per se.

terrance
datstan_mp.Rdump
ordMultiProt_E_norm.stan

Jiqiang Guo

unread,
Dec 28, 2012, 9:51:08 PM12/28/12
to stan-...@googlegroups.com
Just to point out that the gcc in Rtools is 32 bits program, which cannot take advantage of all the RAM in a 64 bits OS. 

--
Jiqiang 

Daniel Lee

unread,
Dec 28, 2012, 9:54:38 PM12/28/12
to stan-...@googlegroups.com
Have you tried using the command line Stan? I was successfully able to compile the attached model from the command line using the g++ from Rtools under Windows 7.


--
 
 

Ben Goodrich

unread,
Dec 29, 2012, 1:59:53 AM12/29/12
to stan-...@googlegroups.com
It is conceivable that the model compiles at O3 but can't compile with a 32bit g++ at O0. In any event, a runtime error appears with 64bit everything and O0:

goodrich@CYBERPOWERPC:/tmp/SU/tds$ ./ordMultiProt_E_norm --data=datstan_mp.Rdump

Exception: INDEX OPERATOR [] OUT OF BOUNDS; index=3; lower bound=1; upper bound=2; index position=1; z1

Diagnostic information: 
Dynamic exception type: std::out_of_range
std::exception::what: INDEX OPERATOR [] OUT OF BOUNDS; index=3; lower bound=1; upper bound=2; index position=1; z1

goodrich@CYBERPOWERPC:/tmp/SU/tds$ grep -n z1 ordMultiProt_E_norm.stan 
58:     matrix[D1,ncut1] z1[beta];
81:     corr_matrix[D1] R_z1; 
88:     vector<lower=0>[D1] s_z1;
131:    Zeta1           <- Zeta1_norm * transpose(cholesky_decompose(R_z1)) * diag_matrix(s_z1);
146:      /* generate simplex, gamma1[d,1:(ncut1-1)] from standard normals, z1[d,1:(ncut2-1),1:beta] */
147:      for( c in 1:ncut1 ) gamma1_u[d,c] <- dot_self(z1[d,c]); /* each z1[d,c] is of length beta */
229:    for( d in 1:D1) for( c in 1:ncut1 ) z1[d,c]     ~ normal(0,1);


And, FYI, although it is unrelated to the problems you are having, as from v1.1.0, you can put pretty much anything to the left of a ~ if only scalars are passed to the distribution. So, for example

for( d in 1:Ds ) col(Phi_norm,d) ~ normal(0,1);

can just become the somewhat faster and simpler

Phi_norm ~ normal(0,1);

and so forth.

Ben

terrance savitsky

unread,
Dec 29, 2012, 9:20:39 PM12/29/12
to stan-...@googlegroups.com
Thanks very much for the diagnositics.  As Daniel suggests, I need to switch to running stan from the command line to capture runtime errors (to eliminate the possibility of crashing R).   Anyway, now everything is working.  Some comments related to your comments:

1. putting anything to the left of a ~ where scalars are passed to the distribution works when a is an array of scalar elements.  It will not compile, for example, for "z" under the following declaration:  vector[beta] z[K];  - presumably because each element of z is a vector.  Anyway, your tip (as always) is very helpful.  Will the user find it in the documentation?

2. On a related point, there is an additional minor error on page 73 for Ben's enumeration of the approach to produce a simplex under a dirichlet distribution from a collection of standard normal variates.   The declaration of z should be changed to:  vector[beta] z[K];

3. I will report back the run properties of this model.  It is a magilla/ginormous multivariate model and, if it samples efficiently, that's really exciting for  the possibilities of using stan under multiple response types.

terrance.

Bob Carpenter

unread,
Dec 30, 2012, 12:16:53 AM12/30/12
to stan-...@googlegroups.com


On 12/29/12 9:20 PM, terrance savitsky wrote:
> Thanks very much for the diagnositics. As Daniel suggests, I need to switch to running stan from the command line to
> capture runtime errors (to eliminate the possibility of crashing R). Anyway, now everything is working. Some comments
> related to your comments:
>
> 1. putting anything to the left of a ~ where scalars are passed to the distribution works when a is an array of scalar
> elements. It will not compile, for example, for "z" under the following declaration: vector[beta] z[K]; - presumably
> because each element of z is a vector. Anyway, your tip (as always) is very helpful. Will the user find it in the
> documentation?

Yes, the vectorization is defined in the doc in section 25.1,
"vectorization". The problem with a manual so long is where to put
everything. I don't want to be too repetitive or it'll be even longer.

What we could really use is an index.

What the doc says is that where a signature says "reals",
that's a stand-in for any of:

real scalar
real[] 1D array
vector column vector
row_vector row vector

We might extend this further, but we haven't settled on
all the boundary conditions (like when one input is
a 1D array and another a 2D array).

> 2. On a related point, there is an additional minor error on page 73 for Ben's enumeration of the approach to produce a
> simplex under a dirichlet distribution from a collection of standard normal variates. The declaration of z should be
> changed to: vector[beta] z[K];

Thanks. We fixed another of these issues and I just
fixed that one. It would lead you to believe our
vectorization was more general than the answer to (1)
above, but it was an error in the example model, which
also needs a loop.

> 3. I will report back the run properties of this model. It is a magilla/ginormous multivariate model and, if it samples
> efficiently, that's really exciting for the possibilities of using stan under multiple response types.

That'll be interesting to see. It'll also be interesting
to see why it fails to sample, if it does. We're working
on two things that may help in general, both related to
modeling more covariance in the models -- first, a non-diagonal
mass matrix adaptation, which should be ready for the next
release, and longer term, Riemann-manifold HMC. That's closer
than I would've thought now that we have Michael Betancourt
on board for a couple of months.

- Bob
> MS-DOS style path detected: C:/PROGRA~1/R/R-215~1.1/etc/__x64/Makeconf
> Preferred POSIX equivalent is: /cygdrive/c/PROGRA~1/R/R-215~__1.1/etc/x64/Makeconf
> CYGWIN environment variable option "nodosfilewarning" turns off this warning.
> Consult the user's guide for more details about POSIX paths:
> http://cygwin.com/cygwin-ug-__net/using.html#using-pathnames
> <http://cygwin.com/cygwin-ug-net/using.html#using-pathnames>
> D:/Users/savitsky/Documents/R/__win-library/2.15/rstan/__include//stansrc/stan/agrad/__agrad.hpp:2171:17:
> warning: 'void stan::agrad::free_memory()' defined but not used [-Wunused-function]
> c:/rtools/gcc-4.6.3/bin/../__lib/gcc/i686-w64-mingw32/4.6.__3/../../../../i686-w64-__mingw32/bin/as.exe:
> file2d646a386305.o: too many sections (39205)
> D:\Users\savitsky\AppData\__Local\Temp\20\ccZoO8Jj.s: Assembler messages:
> D:\Users\savitsky\AppData\__Local\Temp\20\ccZoO8Jj.s: Fatal error: can't write file2d646a386305.o: File too big
> c:/rtools/gcc-4.6.3/bin/../__lib/gcc/i686-w64-mingw
> --
>
>

terrance savitsky

unread,
Jan 10, 2013, 5:15:54 PM1/10/13
to stan-...@googlegroups.com
I write to report back on the mixing properties of a multivariate, mixed outcomes model, which I attach.   The model is applied to classroom observations of teacher performances.   The performances are scored on multiple dimensions of an instrument that measures the efficacy of lesson delivery, producing multivariate observations.   The observations are attached to a nested object structure of teacher, classroom (section), lesson, and segments (or portions) of a lesson and then multiple ratings.   The observations are scored by multiple raters assigned to lessons.   I attach a paper to appear in psychometrika that describes the nature of the model.   The punchline is that I have multiple response types which share nested multivariate random effects.   I desire to make inference on the dependence structure of those random effects; most importantly, among teachers.  

The attached diagnostic plots focus on samples from two sets of parameters:  The multivariate teacher random effects (Delta) and variables associated to cutpoints for ordered outcomes (z).   The mixing is relatively poor with 2500 iterations and 1000 burnin.   The model parameters are sampled in a de-scaled fashion.  So there's nothing else I am able to see that I may next do that would produce an improvement.   Mixing is much better in JAGS, though with priors on covariances induced under factor analytic constructions or with employment of inverse wishart priors.  Runtimes are slow and sampling 2500 iterations is about 25% slower than the required JAGS runtime to produce a converged sample of sufficient length.   The attraction to stan in this application is the ability to employ more options for covariance priors since conjugacy is not required for sampling covariance matrices and stan also more readily handles mixed mode data.

I had hoped for a better result because past runs produced lots of messages that proposals were about to be rejected and this one produced none.  You will see that the chains occasionally even get stuck for a bit, before moving.

I'm hoping that "logging" this application may prove useful as part of understanding the performance of future iterations of stan.  Thanks much for your coaching and patience to compose, compile and run the script.

terrance 


pmet695r2_manu.pdf
ordMultiProt_E_norm.stan
density_MP_cutp_stan.pdf
trace_MP_cutp_stan.pdf
autocorr_MP_stan.pdf
trace_MP_stan.pdf

Bob Carpenter

unread,
Jan 10, 2013, 5:27:50 PM1/10/13
to stan-...@googlegroups.com
Thanks. It's very helpful to see this huge models
and get reports on performance (especially vs. JAGS).
And to get a user perspective.

We really need to do a better job on that warning
message. The rejections are just for single states
on the Hamiltonian path, not of the whole proposal
(which is slice sampled along the path). Just think
of this as producing an example way below the slice.
Now if it happens on the first step of the leapfrog algorithm,
you will get a rejection.

Over the next few months, we'll speed up the multivariate
models by at least an order of magnitude. Right now,
we're doing all the derivatives by brute-force auto-dif,
but we can be more clever about this and do all the computations
and gradients using double values and then propagate.
We'll also vectorize them, but I doubt that's your issue.

But even so, we're not going to be able to speed up things like
your rescaling in the transformed parameters block, which
explicitly calls a cholesky decomposition. There may be a better
way to factor that in terms of algebraic operations that have
better auto-dif behavior.

Nothing else jumps out at me that's going to be a huge gain,
but I'm guessing all those Cholesky factorizations are a killer.

- Bob
> goodrich@CYBERPOWERPC:/tmp/SU/__tds$ ./ordMultiProt_E_norm --data=datstan_mp.Rdump
>
> Exception: INDEX OPERATOR [] OUT OF BOUNDS; index=3; lower bound=1; upper bound=2; index position=1; z1
>
> Diagnostic information:
> Dynamic exception type: std::out_of_range
> std::exception::what: INDEX OPERATOR [] OUT OF BOUNDS; index=3; lower bound=1; upper bound=2; index
> position=1; z1
>
> goodrich@CYBERPOWERPC:/tmp/SU/__tds$ grep -n z1 ordMultiProt_E_norm.stan
> 58: matrix[D1,ncut1] z1[beta];
> 81: corr_matrix[D1] R_z1;
> 88: vector<lower=0>[D1] s_z1;
> 131: Zeta1 <- Zeta1_norm * transpose(cholesky_decompose(__R_z1)) * diag_matrix(s_z1);
> 146: /* generate simplex, gamma1[d,1:(ncut1-1)] from standard normals, z1[d,1:(ncut2-1),1:beta] */
> 147: for( c in 1:ncut1 ) gamma1_u[d,c] <- dot_self(z1[d,c]); /* each z1[d,c] is of length beta */
> 229: for( d in 1:D1) for( c in 1:ncut1 ) z1[d,c] ~ normal(0,1);
>
> |
>
> And, FYI, although it is unrelated to the problems you are having, as from v1.1.0, you can put pretty much
> anything
> to the left of a ~ if only scalars are passed to the distribution. So, for example
>
> |
> for( d in 1:Ds ) col(Phi_norm,d) ~ normal(0,1);
> |
>
> can just become the somewhat faster and simpler
>
> |
> Phi_norm ~ normal(0,1);
> |
>
> and so forth.
>
> Ben
>
>
> On Friday, December 28, 2012 9:39:12 PM UTC-5, tds151 wrote:
>
>
>
>
> 1. Sorry that I didn't earlier employ set_cppo("debug"). I assumed I was successfully compiling and
> linking,
> but encountering a runtime problem. Anyway, I receive an odd error message about failure to compile due to
> insufficient memory. There is a warning statement regarding the free_memory() function of agrad. If I
> run the
> script under fast compile O3 mode, R crashes. The error response is odd b/c it occurs during the
> compile phase
> and I'm running the script on a machine with 80GB of unused RAM. During the compilation step, a maximum of
> 1.2GB is used. That said, the processor on the machine is nearly maxed out.
>
> Error in compileCode(f, code, language = language, verbose = verbose) :
> Compilation ERROR, function(s)/method(s) not created! cygwin warning:
> MS-DOS style path detected: C:/PROGRA~1/R/R-215~1.1/etc/____x64/Makeconf
> Preferred POSIX equivalent is: /cygdrive/c/PROGRA~1/R/R-215~____1.1/etc/x64/Makeconf
>
> CYGWIN environment variable option "nodosfilewarning" turns off this warning.
> Consult the user's guide for more details about POSIX paths:
> http://cygwin.com/cygwin-ug-____net/using.html#using-pathnames
> <http://cygwin.com/cygwin-ug-__net/using.html#using-pathnames>
> <http://cygwin.com/cygwin-ug-__net/using.html#using-pathnames
> <http://cygwin.com/cygwin-ug-net/using.html#using-pathnames>__>
>
> D:/Users/savitsky/Documents/R/____win-library/2.15/rstan/____include//stansrc/stan/agrad/____agrad.hpp:2171:17:
>
> warning: 'void stan::agrad::free_memory()' defined but not used [-Wunused-function]
>
> c:/rtools/gcc-4.6.3/bin/../____lib/gcc/i686-w64-mingw32/4.6.____3/../../../../i686-w64-____mingw32/bin/as.exe:
>
> file2d646a386305.o: too many sections (39205)
> D:\Users\savitsky\AppData\____Local\Temp\20\ccZoO8Jj.s: Assembler messages:
> D:\Users\savitsky\AppData\____Local\Temp\20\ccZoO8Jj.s: Fatal error: can't write file2d646a386305.o:
> File too big
> c:/rtools/gcc-4.6.3/bin/../____lib/gcc/i686-w64-mingw
> On Wed, Dec 26, 2012 at 5:23 PM, terrance savitsky <tds...@gmail.com <mailto:tds...@gmail.com>>
> --
>
>

terrance savitsky

unread,
Jan 10, 2013, 9:40:09 PM1/10/13
to stan-...@googlegroups.com
thanks, Bob.  Too bad i couldn't achieve reasonable mixing - I know you're looking more general formulations for the mass matrix to reduce the scale sensitivity.   I'm curious, what do you mean by "propagate"?  As a non-developer, my mental picture of the auto-diff is that it is conducting repeated numerical differentiations on each of the multiple no u-turn steps that compose a proposal.  If it were possible to use symbolic gradient computations, then one could just compute the structure, once, and the rest reduces to filling a matrix.   Within a numerical approach, i understand you could vectorize the gradient computations for speed; otherwise, isn't it about avoiding duplicate computations?

Ben Goodrich

unread,
Jan 11, 2013, 5:29:11 PM1/11/13
to stan-...@googlegroups.com
On Thursday, January 10, 2013 5:27:50 PM UTC-5, Bob Carpenter wrote:
Over the next few months, we'll speed up the multivariate
models by at least an order of magnitude.  Right now,
we're doing all the derivatives by brute-force auto-dif,
but we can be more clever about this and do all the computations
and gradients using double values and then propagate.
We'll also vectorize them, but I doubt that's your issue.

But even so, we're not going to be able to speed up things like
your rescaling in the transformed parameters block, which
explicitly calls a cholesky decomposition.  There may be a better
way to factor that in terms of algebraic operations that have
better auto-dif behavior.

Nothing else jumps out at me that's going to be a huge gain,
but I'm guessing all those Cholesky factorizations are a killer.

Adding a Cholesky matrix as a primitive type would help a lot in this situation and be a lot quicker to implement than various derivative tricks and / or covariance matrix classes.

You can make your own Cholesky matrix in the transformed parameters block out of a vector of parameters as long as you remember to

-- restrict the diagonal to be positive
-- remember the log Jacobian determinant when putting a prior on the covariance matrix

There is a section of the manual in the Optimizing Stan Code chapter that combines these two points to get a Wishart prior on the covariance matrix from normal and chi-squared priors on the free elements of the Cholesky matrix

Ben

terrance savitsky

unread,
Jan 11, 2013, 8:15:48 PM1/11/13
to stan-...@googlegroups.com
ben, is your point that i might be able to speed up the runtimes if i build the cholesky from a vector of parameters, rather than sampling from a matrix on which i perform a cholesky decomposition?  

on a point related to modeling, the model i posted does not appear to be effectively estimable under stan b/c the mixing is poor.  so getting the mixing to go is a gate that must be passed through before further addressing speed.  what worries me about the inability to get the samples to mix for this particular model is that it is common for me to have models where the effective number of parameters grows above some rate with sample size (though it is identified and estimable).   it is my non-technical intuition that the current HMC implementation in stan has a non-negligible probability of choking on such models due to the possible increase in scale variation with sample.  it's my sense that most of the effective number of parameters grows at a very low rate with sample size for most models posted on this list.

terrance

Ben Goodrich

unread,
Jan 11, 2013, 8:23:43 PM1/11/13
to stan-...@googlegroups.com
On Friday, January 11, 2013 8:15:48 PM UTC-5, tds151 wrote:
ben, is your point that i might be able to speed up the runtimes if i build the cholesky from a vector of parameters, rather than sampling from a matrix on which i perform a cholesky decomposition?  

Yes, with less memory also.
 
on a point related to modeling, the model i posted does not appear to be effectively estimable under stan b/c the mixing is poor.  so getting the mixing to go is a gate that must be passed through before further addressing speed. what worries me about the inability to get the samples to mix for this particular model is that it is common for me to have models where the effective number of parameters grows above some rate with sample size (though it is identified and estimable).   it is my non-technical intuition that the current HMC implementation in stan has a non-negligible probability of choking on such models due to the possible increase in scale variation with sample.  it's my sense that most of the effective number of parameters grows at a very low rate with sample size for most models posted on this list.

My first inclination would be to say that the geometry is the main problem, rather than the number of parameters per se. I think Marcus has done models with tens of thousands of parameters. But it is hard to figure out the root of the problem for a model with so many parameters.

Ben

terrance savitsky

unread,
Jan 11, 2013, 8:32:44 PM1/11/13
to stan-...@googlegroups.com
ben, re; geometry, the model has 3 types of parameters:  multivariate random effects, covariance matrices, and ordered cutpoints.    with your coaching, the multivariate random effects and cutpoints were sampled as standard normals and transformed.  the covariance matrices were disaggregated into correlation matrices and scale parameters.  the correlation matrices were sampled as uniform.   then is the problem likely in the correlation matrices?  since i use the cholesky decomposition of the correlation matrices to scale the standard normals for the random effects, do you think that your point about building a cholesky decomposition from a vector of parameters would solve the problem?


--
 
 

Ben Goodrich

unread,
Jan 12, 2013, 3:55:02 PM1/12/13
to stan-...@googlegroups.com
On Friday, January 11, 2013 8:32:44 PM UTC-5, tds151 wrote:
ben, re; geometry, the model has 3 types of parameters:  multivariate random effects, covariance matrices, and ordered cutpoints.    with your coaching, the multivariate random effects and cutpoints were sampled as standard normals and transformed.  the covariance matrices were disaggregated into correlation matrices and scale parameters.  the correlation matrices were sampled as uniform.   then is the problem likely in the correlation matrices?  since i use the cholesky decomposition of the correlation matrices to scale the standard normals for the random effects, do you think that your point about building a cholesky decomposition from a vector of parameters would solve the problem?

It is hard to speculate about. A uniform prior on a correlation matrix is equivalent to a series of independent but symmetric beta priors on the primitives, which are almost normal. I think that building up a L matrix would help the wall time and the RAM consumption but not the mixing. Really what we need is to see which parameters have dissimilar variances and high correlations. The pairs() function is useful for this, but really only up to about 10 parameters at a time.

Ben
 

terrance savitsky

unread,
Jan 12, 2013, 5:05:51 PM1/12/13
to stan-...@googlegroups.com
Not to belabor, but a quick additional data point.   The run result I recently reported as demonstrating poor mixing sampled the cutpoints (of an ordered response likelihood) using your approach of generating vectors of standard normals to produce dirichlet-distributed simplex vector cumsum'd to an ordered vector.  The chains do move under this alternative and demonstrate a move towards convergence, but slowly with the chains sometimes getting stuck (not moving) for some iterations.  I also ran an alternative which directly specified a simplex cumsum'd to a vector of cutpoints.   That run just concluded and the chains did not move, at all.  it appears that not a single proposal was accepted.  So, pretty clearly, the use of approach for generating cutpoints enumerated in the manual is much better - due to your point about getting as close to a standard normal geometry.  

Bob Carpenter

unread,
Jan 13, 2013, 12:56:57 AM1/13/13
to stan-...@googlegroups.com


On 1/12/13 5:05 PM, terrance savitsky wrote:
> Not to belabor,

Please do! We really appreciate the feedback.
As you can see, we're playing around with many of these
ideas ourselves, so don't have definitive answers.

> but a quick additional data point. The run result I recently reported as demonstrating poor mixing
> sampled the cutpoints (of an ordered response likelihood) using your approach of generating vectors of standard normals
> to produce dirichlet-distributed simplex vector cumsum'd to an ordered vector. The chains do move under this
> alternative and demonstrate a move towards convergence, but slowly with the chains sometimes getting stuck (not moving)
> for some iterations. I also ran an alternative which directly specified a simplex cumsum'd to a vector of cutpoints.
> That run just concluded and the chains did not move, at all. it appears that not a single proposal was accepted. So,
> pretty clearly, the use of approach for generating cutpoints enumerated in the manual is much better - due to your point
> about getting as close to a standard normal geometry.

Was it poor convergence or poor mixing after convergence,
or both? We have had issues with our default overdispersed
starting points being a bit too overdispersed for some
regression models, causing very slow adaptation
during warmup due to the time spent in regions of parameter
space far away from areas of high posterior mass.

- Bob
Reply all
Reply to author
Forward
0 new messages