efficiency of stan and mcmc methods in general

890 views
Skip to first unread message

Marco

unread,
Apr 4, 2013, 10:43:07 PM4/4/13
to stan-...@googlegroups.com
Good night, everybody.

I feel bad and apologize for the size of this post. So I made a short and a long version of the problem.

Either way, I won't feel offended if nobody wants to read it.


-------------------------


Short story:

Something that takes just some minutes to maximize a RE-ML like algorithm is taking me many days to estimate in a Bayesian Approach, either with a custom made software, JAGS or Stan.

It's a zero-inflated Poisson model with 1e6 observations and about 120 parameters. A version with 1e5 observations also takes too long. It's just the regression, with flat prior, no extra. I saw things ridiculously much bigger and complex than that in some articles, so I thought this would should a simple matter to estimate.

Is something like that normal or am I missing something?


-------------------------


Long story:

I has been about 4 months since I started trying to estimate a zero-inflated Poisson (ZIP) regression with, at least for now, flat improper priors. So far, I tried fitting it with JAGS (using the slow one/zeros trick) and took 50s for each interaction, so i gave up.

Then I wrote my own code in C++ to do an adaptative Metropolis sampling (as proposed by Hario, et al), and I let it run the whole night using all the processor cores (each doing a chain) with a filtered part of my database (sample size close to 1e5 and about 120 parameters) and I still didn't got convergence, coda::gelman.diag gets multifactor test bigger than 20.

I was pretty disappointed by that time, as my professor got a frequentist estimate in Stata in less than 3 minutes with the full dataset of more than 1e6 size and I also got one in 10 minutes with R pscl::zeroinf using the filtered database of size 1e5, this R package was not able to estimate the full 1e6 database, thought and I don't have Stata to test it on my computer.

So, this week, I've heard about Stan, I decided to try it out. I couldn't get it to work well without forcing my beta and gamma parameters lie in a small range or using an inappropriate prior for them like normal(0,1). But, even with those inadequate priors, as the sample size and the number of parameters grows, it just don't go anywhere, it get stuck in the beginning.


And the problem goes bigger as I was pretending to (and I'll probably have to) expand this model in the following ways:
>> Apply zero-inflated negative binomial (ZINB)
>> Compare, via transdimensional MCMC, the ZIP, ZINB, Poisson and Negative Binomial Regressions Models.
>> Do a panel data analysis, with random effects, which would cause it to have thousands of parameters.
>> Do the analysis in a more extended period of time: this would get even bigger than 1e7

So, if the situation if already bad, then the way it looks, I won't get convergence unless I wait years in those cases. So, I must be missing something, what would you recommend me to do? Will Stan work for this or should I try another strategy like an custom-made software running some algorithm? How many time do you think it's resonable for it take to archive convergence in a problem like this?


Obs.: In all cases, using the REML estimates as starting points doesn't solve the problems and also this is not an option with the 1e6 full dataset, as I can't get an estimate for it with pscl::zeroinfl.

My processor is an Intel(R) Core(TM) i7-2670QM CPU @ 2.20GHz, which I guess, is not that bad and with 4GB of RAM I had no problems of memory swapping while running the models (but it was close to it sometimes).


--------------------


Here follows the code of my implementation in Stan, if it is of interest of someone (the code should be fully reproducible):

require(rstan)
require(coda)

zi_model_code <- '
data {
    int <lower=0> nrow;
    int <lower=0> ncol;
    int <lower=0> yy[nrow];
    real bbgg[nrow,ncol];
}
parameters {
    real beta0;
    real beta[ncol];
    real gamma0;
    real gamma[ncol];
}
transformed parameters {
    real <lower=0> mu[nrow];
    real <lower=0,upper=1> w[nrow];
    real eta[nrow];
    real zeta[nrow];
    for (i in 1:nrow)
        {
        eta[i] <- beta0;
        zeta[i] <- gamma0;
        for (j in 1:ncol)
            {
            eta[i] <- eta[i] + bbgg[i,j] * beta[j];
            zeta[i] <- zeta[i] + bbgg[i,j] * gamma[j];
            }
        mu[i] <- exp(eta[i]);
        if( zeta[i] > 700 )
            w[i] <- 1;
        else
            w[i] <- exp(zeta[i])/(1+exp(zeta[i]));
        }
}
model {
    for (i in 1:nrow)
        { #full likelihood function is: w[i]*is_equal(y[i],0) + (1-w[i])*poison(y[i])
        if (yy[i]==0)
            lp__ <- lp__ + log( w[i] + exp(-mu[i])*(1-w[i]) );
        else
            lp__ <- lp__ + yy[i]*eta[i]+log(1-w[i])-mu[i]; //log_factorial(y[i]) was excluded because is not part of the kernel
        }

//Some priors that speed-up the process, but are inadequate =[
#      beta0 ~ normal(0, 1);
#      gamma0 ~ normal(0, 1);
#      for (i in 1:ncol)
#          {
#          beta[i] ~ normal(0, 1);
#          gamma[i] ~ normal(0, 1);
#          }
}
'

#model definition ends here





sampleSize<-1000;
set.seed(10)

#Random zero-inflated poisson generator
rZIP = function(x,mu,w)
{
a=ifelse(runif(x)<w,0,1)
count1=length(which(a==1))
if(length(mu)!=1)
    {
    if (length(mu)!=x) stop("mu must have the size 1 or the same size of x")
    mu <- mu[a==1] #select only values where the first stage got 1
    }
if(length(w)!=1)
    {
    if (length(w)!=x) stop("w must have the size 1 or the same size of x")
    w <- w[a==1] #select only values where the first stage got 1
    }
a[a==1]=rpois(count1,mu)
a
}

#Generate the yy vector of  response variable and the covariate matrix bbgg
#Zero-inflated models allow different bb and gg matrix, but for my problem bb=gg=bbgg
assign('yy',c(rZIP(sampleSize/2,1.2,0.7),rZIP(sampleSize/2,1.3,0.8)),1)
x1<- c(rep(1,sampleSize/2),rep(0,sampleSize/2))
x2 <- c(rep(1,sampleSize/4),rep(0,sampleSize/4),rep(1,sampleSize/4),rep(0,sampleSize/4))
assign('bbgg', cbind(x1,x2), 1)

#yy, bbgg from my real database
# load("data_prep.Rdata")
# yy<-yy[1:sampleSize]
# bbgg<-bbgg[1:sampleSize,]


zi_model_dat <- list( yy = yy, bbgg = bbgg, nrow=NROW(yy), ncol=NCOL(bbgg) )

#Compile only if modified
if(ifelse(exists("zi_model_code2"),zi_model_code2,"")!=zi_model_code || !exists("modelObj"))
{
    if (exists("modelObj")) rm(modelObj)
    zi_model_code2<-zi_model_code
    modelObj <- stan_model(model_code = zi_model_code)
}

#inits<-list("beta0"=0.1726, "beta[1]"=-0.8719, "beta[2]"=0.3072,"gamma0"=1.042,"gamma[1]"=-1.163,"gamma[2]"=1.463)
print(system.time(
fit <- sampling(modelObj, data = zi_model_dat,
            iter = 10000, chains = 2, pars=c("beta0","beta","gamma0","gamma"), init=0, refresh=1000 )
))


#stan to coda generator
rstanToCoda <- function(object)
{
    require(rstan)
    require(coda)
    object <- as.array(object)
    mcmcTemp <- list()
    for (i in 1:(dim(object)[2]))
        mcmcTemp[[i]] <- mcmc(as.array(fit)[,i,])
    do.call(mcmc.list,mcmcTemp)
}
a<-rstanToCoda(fit)
#a<-a[,varnames(a)!="lp__"]
gelman.diag( a )



Ben Goodrich

unread,
Apr 4, 2013, 11:18:29 PM4/4/13
to stan-...@googlegroups.com
On Thursday, April 4, 2013 10:43:07 PM UTC-4, Marco Inacio wrote:

In general, you will undermine the wall time with constructions like this that change the log-posterior at every observation:


model {
    for (i in 1:nrow)
        { #full likelihood function is: w[i]*is_equal(y[i],0) + (1-w[i])*poison(y[i])
        if (yy[i]==0)
            lp__ <- lp__ + log( w[i] + exp(-mu[i])*(1-w[i]) );
        else
            lp__ <- lp__ + yy[i]*eta[i]+log(1-w[i])-mu[i]; //log_factorial(y[i]) was excluded because is not part of the kernel
        }
 
Try this instead:

model {
  vector
[nrow] summands;
 
for(i in 1:nrow) {
   
if(yy[i] == 0)
      summands
[i] <- log( w[i] + exp(-mu[i])*(1-w[i]) );
   
else
      summands
[i] <- yy[i]*eta[i]+log(1-w[i])-mu[i];  
 
}
  lp__
<- lp__ + sum(summands);
}

That will help with the wall time but won't do anything to fix any other problems with the model. But try that and see if you can get enough iterations to achieve convergence or at least to see better what it preventing convergence.

Ben


Bob Carpenter

unread,
Apr 4, 2013, 11:58:57 PM4/4/13
to stan-...@googlegroups.com
Stan's not going to be super fast in its current
incarnation for 1M data points, no matter what you do.
So if you want a regression fit in 3 minutes for
1M data points and 120 parameters, you're probably going
to have to look elsewhere.

But if your model makes sense, Stan should be able to
fit it, even with 1M data points.

I don't understand your model, but have some general
advice.

First, try small, simulated data sets to make
sure the model is correct and getting the right answers.
Then increase data set sizes.

Size itself shouldn't be a problem, but arithmetic issues
can be. Stan doesn't do any kind of automatic ignoring
of illegal or missing values, and if you wind up with
arithmetic overflow/underflow in the wrong places, it will
just reject everything.

Why don't you think normal(0,1) priors are appropriate?
Are you expecting large values for the coefficients?

Are you sure your professor's approach in Stata got a sensible
answer? If so, could you post the stata model or whatever
it was that was called to do this?

Your model in Stan could be made a lot faster by vectorizing
(and probably with some algebra), but you should first make sure
you're trying to fit the right model.

Are you sure that you don't have values that are overflowing
exp? If you do arithmetic that winds up with not-a-number
values or infinite values, Stan will just reject the samples.
These can often be tamed with a little algebra.

What's the scale of the data? That test for > 700
is worrisome, because exp(x)/(1 + exp(x)) is going to be 1
for values much less than 700 in floating point arithmetic!
This is very worrisome:

> if( zeta[i] > 700 )
> w[i] <- 1;
> else
> w[i] <- exp(zeta[i])/(1+exp(zeta[i]));

And then you use (1 - w[i]), and subtraction is notoriously
bad for arithmetic precision with values of w[i] close to 1
or close to 0.

- Bob


On 4/4/13 10:43 PM, Marco wrote:
> Good night, everybody.
>
> I feel bad and apologize for the size of this post. So I made a short and a long version of the problem.
>
> Either way, I won't feel offended if nobody wants to read it.
>
>
> -------------------------
>
>
> Short story:
>
> Something that takes just some minutes to maximize a RE-ML like algorithmis taking me many days to estimate in a
> --
> You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Marco

unread,
Apr 5, 2013, 11:35:12 PM4/5/13
to stan-...@googlegroups.com

 
Try this instead:

model {
  vector
[nrow] summands;
 
for(i in 1:nrow) {
   
if(yy[i] == 0)
      summands
[i] <- log( w[i] + exp(-mu[i])*(1-w[i]) );
   
else
      summands
[i] <- yy[i]*eta[i]+log(1-w[i])-mu[i];  
 
}
  lp__
<- lp__ + sum(summands);
}

That will help with the wall time but won't do anything to fix any other problems with the model. But try that and see if you can get enough iterations to achieve convergence or at least to see better what it preventing convergence.

Ben

Hi, thanks very much for the help.

I got the ideia about not changing the LL in every observation. But about something related to that, I used loops and just arrays instead of vectors and matrices because I saw in one topic - I don't remember which one - that matrices operations are slower because sparsity problems. Is that also true for vectors? If so in this situation, is it quicker to use summands as array instead of vector?



Marco

unread,
Apr 5, 2013, 11:35:09 PM4/5/13
to stan-...@googlegroups.com

Hi, thank you very very much for your answer.


Stan's not going to be super fast in its current
incarnation for 1M data points, no matter what you do.
So if you want a regression fit in 3 minutes for
1M data points and 120 parameters, you're probably going
to have to look elsewhere.

But if your model makes sense, Stan should be able to
fit it, even with 1M data points.
I'm sorry if made it look like I was complaining with the software or anything like that, I sincerely admire how you guys made a pretty well organized and quick software in so little time. I'm more curious in fact, about how much time should I expect a model this big (1M x 120 parameters) to converge, or at least come close to that, using Stan and also using MCMC methods in general (Gibbs would probably takes a lot since the priors are not conjugate), this considering I'm using all the 8 cores of the processor to generate 1 chain each. A full day? Some hours? Less than one hour?





I don't understand your model, but have some general
advice.

First, try small, simulated data sets to make
sure the model is correct and getting the right answers.
Then increase data set sizes.
I came to learn this by the worse way (not that bad, thought) some months ago, so my code posted is using simulated data.





Size itself shouldn't be a problem, but arithmetic issues
can be.  Stan doesn't do any kind of automatic ignoring
of illegal or missing values, and if you wind up with
arithmetic overflow/underflow in the wrong places, it will
just reject everything.
That might be the problem that is making my model take too long to converge, I'll try to debug this.







Why don't you think normal(0,1) priors are appropriate?
Are you expecting large values for the coefficients?
I just think that it doesn't expose correctly my uncertainty about the parameters, 1e2 is as much credible, a priori, as 0 for a parameter value. For example, I got a lot of values in the range of those two numbers for my estimates in MLE.





Are you sure your professor's approach in Stata got a sensible
answer?  If so, could you post the stata model or whatever
it was that was called to do this?
The results were pretty similar to those I got using R. I don't have the results of it right not with me, but it seens he just followed the syntax zip count child camper, inflate(persons) vuong in this tutorial (links work on this mailing list?).

It's seens that Stata uses an algorithm that fits the constant only model, and then the full model, this might be well optimized for the zero-inflated problem. And, also, my professor's version is multicore, if I'm not mistaken: so, these two reasons might explain why he got convergence so quickly.





Your model in Stan could be made a lot faster by vectorizing
(and probably with some algebra), but you should first make sure
you're trying to fit the right model.
I read in some post in the mailing list that matrices are slower than loops, so, I assumed the same for vectors and wrote everything in term of loops. But vectors are better the arrays with loops? Or are you referring to another thing with the word "vectorizing"?







Are you sure that you don't have values that are overflowing
exp?  If you do arithmetic that winds up with not-a-number
values or infinite values, Stan will just reject the samples.
These can often be tamed with a little algebra.

What's the scale of the data?  That test for > 700
is worrisome, because exp(x)/(1 + exp(x)) is going to be 1
for values much less than 700 in floating point arithmetic!
This is very worrisome:

>          if( zeta[i] > 700 )
>              w[i] <- 1;
>          else
>              w[i] <- exp(zeta[i])/(1+exp(zeta[i]));
That's true, -log(.Machine$double.eps) in R gave me 36.04. I'm ashamed I forgot about this temporary "hack" I left there.





And then you use (1 - w[i]), and subtraction is notoriously
bad for arithmetic precision with values of w[i] close to 1
or close to 0.
That's something I didn't payed attention, I'll correct that.

Ben Goodrich

unread,
Apr 6, 2013, 1:43:03 AM4/6/13
to stan-...@googlegroups.com

I think the sum of an array and the sum of a vector are essentially the same in terms of speed. Either way, sum() accumulates the derivatives in an efficient manner.

Ben

 

Marco

unread,
Apr 6, 2013, 7:41:29 PM4/6/13
to stan-...@googlegroups.com


I think the sum of an array and the sum of a vector are essentially the same in terms of speed. Either way, sum() accumulates the derivatives in an efficient manner.

Ben


Do you know if sum() takes care of float precision using, for example, Kahan summation algorithm? My code now is highly dependent on something like that.


model {
    real summandsZ1[nZeros];
    real summandsZ2[nZeros];
    real summandsZ3[nZeros];
    real summandsN1[nNonZeros];
    real summandsN2[nNonZeros];
    real summandsN3[nNonZeros];
    int j;
    int k;
    j<-1;
    k<-1;

    for (i in 1:nrow)
        { #likelihood function is: inv_logit(zeta[i])*equals(y[i],0) + (1-w[i])*poison(y[i])
        if (yy[i]==0)
            {
            summandsZ1[j] <- zeta[i];
            summandsZ2[j] <- log1p_exp(-zeta[i]-mu[i]);
            summandsZ3[j] <- -log1p_exp(zeta[i]);
            j<-j+1;
            }
        else
            {
            summandsN1[k] <- -mu[i];
            summandsN2[k] <- yy[i]*eta[i];
            summandsN3[k] <- -log1p_exp(zeta[i]);
            k<-k+1;
            }
        }

    lp__ <- lp__ + sum(summandsZ1) + sum(summandsZ2) + sum(summandsZ3) + sum(summandsN1) + sum(summandsN2) + sum(summandsN3);

}

Two matrices of summands or 6 vectors like now?

Ben Goodrich

unread,
Apr 6, 2013, 10:43:33 PM4/6/13
to stan-...@googlegroups.com
On Saturday, April 6, 2013 7:41:29 PM UTC-4, Marco Inacio wrote:

I think the sum of an array and the sum of a vector are essentially the same in terms of speed. Either way, sum() accumulates the derivatives in an efficient manner.

Ben


Do you know if sum() takes care of float precision using, for example, Kahan summation algorithm? My code now is highly dependent on something like that.

No, sum() does not do Kahan summation (although it didn't occur to me until now that such a function would be nice to have), so about the best you can do is evaluate the components with as little loss of precision as you can. The sum() function only evaluates the derivatives with less cost.

Ben

John

unread,
Apr 7, 2013, 2:53:46 AM4/7/13
to stan-...@googlegroups.com
I wonder Marco is referring to a larger issue here. Once Marco does finally get a proper Bayesian answer, will it be substantively different from the one which could be obtained in seconds using frequentist software? If not, perhaps he should find ways to exploit the kind of results Bayesian methods provide, such as things related to full posterior distributions, joint distributions of multiple parameters, etc.  However, if he simply desires point estimates and C.I.'s for well established models, is there really a compelling reason to "go Bayesian"?

I sense some frustration here, though I realize that many would claim that these larger, more philosophical issues should be discussed elsewhere.

John

Marco

unread,
Apr 7, 2013, 10:10:44 AM4/7/13
to stan-...@googlegroups.com

> I wonder Marco is referring to a larger issue here. Once Marco does
> finally get a proper Bayesian answer, will it be substantively
> different from the one which could be obtained in seconds using
> frequentist software? If not, perhaps he should find ways to exploit
> the kind of results Bayesian methods provide, such as things related
> to full posterior distributions, joint distributions of multiple
> parameters, etc. However, if he simply desires point estimates and
> C.I.'s for well established models, is there really a compelling
> reason to "go Bayesian"?
>
> I sense some frustration here, though I realize that many would claim
> that these larger, more philosophical issues should be discussed
> elsewhere.
>
> John
I'm in no way frustrated with Bayesian methodology, but looking at my
first email here, I can see I made it feels like that.

There's just some months since I started learning and using it, but I
decided to don't use frequentist statistics, for philosophical reasons,
"power" to change the model and because I strongly dislike hypothesis
testing (and C.I.'s, you mention). In fact, because of it, I decided to
do my masters in Statistics after finishing under-graduation in
Economics (this explain why I'm sooo n00b about all this stuff).

My frustation was in fact with mcmc speed methods I tried. I knew
drawing a sample from the posterior should take more time than a point
estimation, but taking days to converge for something that takes minutes
for frequentists to estimate was disappointing.

But that's my mistake, as until yesterday, I thought that specifying my
likelihood function in a simple form without caring about float point
precision, overflow and underflow was ok because it worked for small,
simulated data. But now I realize that I should have taken full care of
that before complaining.

Anyway, I'm sorry I talked about my views and life here. I'll try to
talk only about technical issues about Stan software from now on.

Ben Goodrich

unread,
Apr 7, 2013, 10:50:04 AM4/7/13
to stan-...@googlegroups.com
On Sunday, April 7, 2013 10:10:44 AM UTC-4, Marco Inacio wrote:
But that's my mistake, as until yesterday, I thought that specifying my
likelihood function in a simple form without caring about float point
precision, overflow and underflow was ok because it worked for small,
simulated data. But now I realize that I should have taken full care of
that before complaining.

Also, the geometry of the posterior distribution in the unconstrained space.
 
Anyway, I'm sorry I talked about my views and life here. I'll try to
talk only about technical issues about Stan software from now on.

There are many such posts on stan-users (and more on stan-dev). Don't worry about it.

Ben
 

Bob Carpenter

unread,
Apr 7, 2013, 1:12:53 PM4/7/13
to stan-...@googlegroups.com


On 4/6/13 7:41 PM, Marco wrote:
>
>>
>> I think the sum of an array and the sum of a vector are essentially the same in terms of speed. Either way, sum()
>> accumulates the derivatives in an efficient manner.

Exactly.

> Do you know if sum() takes care of float precision using, for example, Kahan summation algorithm? My code now is highly
> dependent on something like that.

No, it doesn't. It's just a straight-up sum.
It does the gradients independently, so those
will be OK even if some of the terms in the sum
underflow.

But are you sure you need something like Kahan summation?
It should only be necessary to add numbers of very
different orders of magnitude, and then is only going
to help when you have many of the smaller numbers that
can eventually add up to something that makes a difference
on the larger scale.

You can always code it up yourself in the Stan program
and see if it makes a difference:

http://en.wikipedia.org/wiki/Kahan_summation_algorithm

Warning: it will be much slower because of all the
extra work for derivatives (time is roughly proportional
to the size of expression built up for the log
probability).

We could do this more efficiently by implementing
kahan_sum() and pairwise_sum() functions:

http://en.wikipedia.org/wiki/Pairwise_summation

I'll put it on the to-do list.

- Bob

Bob Carpenter

unread,
Apr 8, 2013, 2:27:05 PM4/8/13
to stan-...@googlegroups.com


On 4/5/13 11:35 PM, Marco wrote:
>
> Hi, thank you very very much for your answer.
>
>
>> Stan's not going to be super fast in its current
>> incarnation for 1M data points, no matter what you do.
>> So if you want a regression fit in 3 minutes for
>> 1M data points and 120 parameters, you're probably going
>> to have to look elsewhere.
>>
>> But if your model makes sense, Stan should be able to
>> fit it, even with 1M data points.
> I'm sorry if made it look like I was complaining with the software or anything like that, I sincerely admire how you
> guys made a pretty well organized and quick software in so little time. I'm more curious in fact, about how much time
> should I expect a model this big (1M x 120 parameters) to converge, or at least come close to that, using Stan and also
> using MCMC methods in general (Gibbs would probably takes a lot since the priors are not conjugate), this considering
> I'm using all the 8 cores of the processor to generate 1 chain each. A full day? Some hours? Less than one hour?

It depends how complicated the log probability
function is. It all comes down to the number of
function applications in computing it. Each of
those functions needs to have its partial derivative
calculated and the chain rule applied, which is where
all the time goes.

To get a feeling, simulate smaller data sets, then increase
the size to measure the trend.

I could fit 1M data point IRT models with 20K parameters
in hours with four chains.

Do you really have 8 cores, or is it hyperthreading? It's
often faster not to use the hyperthreading from Intel.

- Bob



>> Why don't you think normal(0,1) priors are appropriate?
>> Are you expecting large values for the coefficients?

> I just think that it doesn't expose correctly my uncertainty about the parameters, 1e2 is as much credible, a priori, as
> 0 for a parameter value. For example, I got a lot of values in the range of those two numbers for my estimates in MLE.

Great -- then go with what makes sense. If you expect
larger values, the prior should accomodate them.

An alternative, which can be faster, is to rescale.

>> Are you sure your professor's approach in Stata got a sensible
>> answer? If so, could you post the stata model or whatever
>> it was that was called to do this?
> The results were pretty similar to those I got using R. I don't have the results of it right not with me, but it seens
> he just followed the syntax *zip count child camper, inflate(persons) vuong* in this
> <http://www.ats.ucla.edu/stat/stata/dae/zip.htm> tutorial (links work on this mailing list?).
>
> It's seens that Stata uses an algorithm that fits the constant only model, and then the full model, this might be well
> optimized for the zero-inflated problem. And, also, my professor's version is multicore, if I'm not mistaken: so, these
> two reasons might explain why he got convergence so quickly.

OK -- if you have values you trust, you can try
initializing the Stan model with those values and see
what happens. That will at least help you diagnose whether
it's a convergence issue. But having said that, maximum
likelihood (or MAP) estimates can be far away from the
majority of the posterior probability mass in some examples.

Multicore only does so much. It'll be at most
N times faster where N is the number of corese.

>> Your model in Stan could be made a lot faster by vectorizing
>> (and probably with some algebra), but you should first make sure
>> you're trying to fit the right model.
> I read in some post in the mailing list that matrices are slower than loops, so, I assumed the same for vectors and
> wrote everything in term of loops. But vectors are better the arrays with loops? Or are you referring to another thing
> with the word "vectorizing"?

???

An array of vectors is faster at pulling out the vectors.

Otherwise, use as many matrix operations as you can. Some
don't help much for speed, but none of them are slower than
looping.

- Bob

Marco

unread,
Apr 9, 2013, 12:11:51 AM4/9/13
to stan-...@googlegroups.com

> Also, the geometry of the posterior distribution in the unconstrained
> space.
I didn't fully understand what you mean by that. If you know, could you
point me some papers or chapters of books that talk about that?

Marco

unread,
Apr 9, 2013, 12:17:41 AM4/9/13
to stan-...@googlegroups.com

It depends how complicated the log probability
function is.  It all comes down to the number of
function applications in computing it.  Each of
those functions needs to have its partial derivative
calculated and the chain rule applied, which is where
all the time goes.

To get a feeling, simulate smaller data sets, then increase
the size to measure the trend.

I could fit 1M data point IRT models with 20K parameters
in hours with four chains.


Thanks for spending your time answering another big email of mine.

Looks like I found the source of the problem: it's the difference of magnitude of the covariates of the model.


For instance by generating my matrix of covariates bbgg in the following way, it works:
sampleSize<-1e5
bbgg <- c(rep(1,sampleSize/4),rep(0,sampleSize/4),rep(2,sampleSize/2))
for(i in 1:3)
{
x1<- c(rnorm(sampleSize/2,i,runif(1,1,5)),rnorm(sampleSize/2,i*1.2,runif(1,1,5)))
x2 <- c(rnorm(sampleSize/4,i*1.1,runif(1,1,5)),rnorm(sampleSize/4,i*1.4*1000,runif(1,1,5)),rnorm(sampleSize/2,i*1.7,runif(1,1,5)))
bbgg <- cbind(bbgg,x1,x2)
}




On the other hand, when we have a big difference of magnitude, it doesn't work: Stan gets stuck in some early iteration, in this case, the 17ᵗʰ .
(the difference here is only the *1000 in bold)
sampleSize<-1e5
bbgg <- c(rep(1,sampleSize/4),rep(0,sampleSize/4),rep(2,sampleSize/2))
for(i in 1:3)
{
x1<- c(rnorm(sampleSize/2,i,runif(1,1,5)),rnorm(sampleSize/2,i*1.2,runif(1,1,5)))
x2 <- c(rnorm(sampleSize/4,i*1.1*1000,runif(1,1,5)),rnorm(sampleSize/4,i*1.4*1000,runif(1,1,5)),rnorm(sampleSize/2,i*1.7*1000,runif(1,1,5)))
bbgg <- cbind(bbgg,x1,x2)
}


Placing print(lp__) right before the end of model section shows that Stan is actively calculating log-likelihoods of my model although it doesn't move to the next iteration. In fact, the log-likelihood seems to be moving in a circular way.

I know almost nothing about how HMC and NUTS work, so I don't have a good idea of why covariates with different magnitudes are making the Stan unable to continue. Do you know some work around for this problem?



Obs.:

Making the following changes (and also trying to combine then) doesn't seems to help, at most they change the iteration in which it gets stuck:

>>> Starting with MLE obtained previously.

>>> Using normal priors for all beta and gamma (the other parameters are transformed parameters derived from those two and data)

>>> Constraining all beta and gamma parameters to lie in a certain region.

>>> Using Kahan summation


---------------------------------------

Also, I get a strange behavior when I constrain eta and/or zeta (transformed parameters) to certain ranges: when I start sampling, stan suddenly eats tons of RAM memory, like 10 or 20 times more than without constraining then. The whole system then gets stuck because of the aggressive swapping and it takes a lot of time to quit rstan.

I wonder why this happens.


---------------------------------------

Here follows my model specification (should be fully reproducible code):


require(rstan)
require(coda)

zi_model_code <- '
data
{
    int <lower=0> nrow;
    int <lower=0> ncol;
    int <lower=0> yy[nrow];
    matrix [nrow,ncol] bbgg;
    int <lower=0> nZeros;
}
transformed data
{
    int <lower=0> nNonZeros;
    nNonZeros <- nrow-nZeros;
}
parameters
{
    vector [ncol] beta;
    vector [ncol] gamma;

}
transformed parameters
{
    real <lower=0> mu[nrow];
    real eta[nrow];
    real zeta[nrow];
    for (i in 1:nrow)
    {
        eta[i] <- bbgg[i] * beta;
        zeta[i] <- bbgg[i] * gamma;

        mu[i] <- exp(eta[i]);
    }
}
model {
    real summandsZ1[nZeros];
    real summandsZ2[nZeros];
    real summandsZ3[nZeros];
    real summandsN1[nNonZeros];
    real summandsN2[nNonZeros];
    real summandsN3[nNonZeros];
    int j;
    int k;
    j<-1;
    k<-1;
    for (i in 1:nrow)
    { #log-likelihood function is: log( inv_logit(zeta[i])*equals(y[i],0) + (1-w[i])*poison(y[i]) )
       #after some algebra, it goes down to:

        if (yy[i]==0)
        {
            summandsZ1[j] <- zeta[i];
            summandsZ2[j] <- log1p_exp(-zeta[i]-mu[i]);
            summandsZ3[j] <- -log1p_exp(zeta[i]);
            j<-j+1;
        }
        else
        {
            summandsN1[k] <- -mu[i];
            summandsN2[k] <- yy[i]*eta[i];
            summandsN3[k] <- -log1p_exp(zeta[i]);
            k<-k+1;
        }
       #the correctness was checked using simulated data
    }

    lp__ <- lp__ + sum(summandsN1) + sum(summandsN2) + sum(summandsN3) + sum(summandsZ1) + sum(summandsZ2) + sum(summandsZ3);


#      for (i in 1:ncol)
#     {
#          beta[i] ~ normal(0, 10);
#          gamma[i] ~ normal(0, 10);
#     }
   
    #print(lp__);
}
'

#model definition ends here


#Compile only if modified
if(ifelse(exists("zi_model_code2"),zi_model_code2,"")!=zi_model_code || !exists("modelObj"))
{
    if (exists("modelObj")) rm(modelObj)
    zi_model_code2<-zi_model_code
    modelObj <- stan_model(model_code = zi_model_code)
}








#Random zero-inflated poisson generator
rZIP = function(x,mu,w)
{
a=ifelse(runif(x)<w,0,1)
count1=length(which(a==1))
if(length(mu)!=1)
    {
    if (length(mu)!=x) stop("mu must have the size 1 or the same size of x")
    mu <- mu[a==1] #select only values where the first stage got 1
    }
if(length(w)!=1)
    {
    if (length(w)!=x) stop("w must have the size 1 or the same size of x")
    w <- w[a==1] #select only values where the first stage got 1
    }
a[a==1]=rpois(count1,mu)
a
}

sampleSize<-1e5
set.seed(10)

#Generate simulated yy vector of regressand and the covariate matrix bbgg
yy <- c(rZIP(sampleSize/4,1.2,0.7),rZIP(sampleSize/4,1.4,0.6),rZIP(sampleSize/2,1.3,0.8))

bbgg <- c(rep(1,sampleSize/4),rep(0,sampleSize/4),rep(2,sampleSize/2))
for(i in 1:3)
{
x1<- c(rnorm(sampleSize/2,i,runif(1,1,5)),rnorm(sampleSize/2,i*1.2,runif(1,1,5)))
x2 <- c(rnorm(sampleSize/4,i*1.1*1000,runif(1,1,5)),rnorm(sampleSize/4,i*1.4*1000,runif(1,1,5)),rnorm(sampleSize/2,i*1.7*1000,runif(1,1,5)))
bbgg <- cbind(bbgg,x1,x2)
}






bbgg <- cbind(1,bbgg)
zi_model_dat <- list( yy = yy, bbgg = bbgg, nrow=NROW(yy), ncol=NCOL(bbgg), nZeros=NROW(which(yy==0)) )



#inits<-list("beta0"=0.1726, "beta[1]"=-0.8719, "beta[2]"=0.3072,"gamma0"=1.042,"gamma[1]"=-1.163,"gamma[2]"=1.463)
#initsV<-list(beta=as.vector(regfBeta),gamma=as.vector(regfGamma))
#for(i in 1:length(regfBeta)) initsV[[paste0("beta[",i,"]")]]<-regfBeta[i]
#for(i in 1:length(regfGamma)) initsV[[paste0("gamma[",i,"]")]]<-regfGamma[i]


###### This is a function to do paralell sampling
#All parameters are the same as rstan::sampling, except nThreads: the number of used cores and silent: if TRUE, suppress stdout of child processes
mcSampling <- function(object, data = list(), pars = NA, ..., nThreads=2, silent = TRUE)
{
if(nThreads<2)
    stop("For only one thread, use rstan::sampling directly")

require(rstan)
require(multicore)
tryCatch(kill(children(),9),  error = function(e) {})
extraArgsList <- list(...)

pids<-list()
for (i in 1:(nThreads-1))
    pids[[i]]<-parallel({ fit<-do.call(function(...){ sampling(object, data, pars, ...) },extraArgsList) }, silent = silent, mc.set.seed = TRUE)

results<-list()
results[[1]]<-sampling(object, data, pars, ...) #One of the chains is run in the current session, so we can output the results
for (i in 1:(nThreads-1))
    results[[i+1]]<-collect(pids[[i]])[[1]]
   
results #return a list of rstan objects
}

print(system.time(fit <- mcSampling(modelObj, data = zi_model_dat, pars=c("beta","gamma"),
            iter = 1000, chains = 1, init=0, refresh=1, nThreads=2, max_treedepth=-1)))













########### Results analysis ################

#stan to coda converter, accept multiple stanfit objects as arguments, each one can contain 1 or more chains
#it can also accept a list of stanfit objects in a single argument
rstanToCoda <- function(...)
{
    require(rstan)
    require(coda)
    input_list <- list(...)
    if(class(input_list[[1]]) == "list") #support for list of stanfit objects
        return (do.call(rstanToCoda,input_list[[1]]))
    mcmcTemp <- list()
    k=1
    j=1
    while(TRUE)
    {
        object <- as.array(input_list[[k]])

        for (i in 1:(dim(object)[2]))
            {
            mcmcTemp[[j]] <- mcmc(as.array(object)[,i,])
            j=j+1
            }
        k=k+1
        if(k>length(input_list)) break;

    }
    do.call(mcmc.list,mcmcTemp)
}
a<-rstanToCoda(fit)
#a<-a[,varnames(a)!="lp__"]
print(gelman.diag( a ))



Ben Goodrich

unread,
Apr 9, 2013, 12:47:39 AM4/9/13
to stan-...@googlegroups.com

There is not a whole lot written about it because it seems to be a more acute (or more noticed) issue with HMC than with other MCMC methods. The other email you sent about the covariates having different scales is part of it. There is stuff in the Stan manual about it, particularly the funnel example, which is also accessible in R via stan_demo("funnel_reparam"). Probably the best paper is Mike's

http://arxiv.org/abs/1212.4693

and this will become more manageable once he merges in the Riemannian Manifold HMC.

Ben

Bob Carpenter

unread,
Apr 9, 2013, 2:47:46 AM4/9/13
to stan-...@googlegroups.com
The Stan manual touches on it briefly in the discussion of
reparameterization and the funnel example of Radford Neal's.

I'd suggest Radford Neal's MCMC handbook article on HMC for a
start and concentrate on the full mass matrix M. Then
the math gets hairy.

- Bob

Bob Carpenter

unread,
Apr 9, 2013, 3:01:57 AM4/9/13
to stan-...@googlegroups.com


On 4/9/13 12:17 AM, Marco wrote:
>
> Looks like I found the source of the problem: it's the difference of magnitude of the covariates of the model.

That's good news.

> For instance by generating my matrix of covariates bbgg in the following way, it works:
> sampleSize<-1e5
> bbgg <- c(rep(1,sampleSize/4),rep(0,sampleSize/4),rep(2,sampleSize/2))
> for(i in 1:3)
> {
> x1<- c(rnorm(sampleSize/2,i,runif(1,1,5)),rnorm(sampleSize/2,i*1.2,runif(1,1,5)))
> x2 <-
> c(rnorm(sampleSize/4,i*1.1,runif(1,1,5)),rnorm(sampleSize/4,i*1.4*1000,runif(1,1,5)),rnorm(sampleSize/2,i*1.7,runif(1,1,5)))
> bbgg <- cbind(bbgg,x1,x2)
> }
>
>
> On the other hand, when we have a big difference of magnitude, it doesn't work: Stan gets stuck in some early iteration,
> in this case, the 17ᵗʰ .
> (the difference here is only the *1000 in bold)
> sampleSize<-1e5
> bbgg <- c(rep(1,sampleSize/4),rep(0,sampleSize/4),rep(2,sampleSize/2))
> for(i in 1:3)
> {
> x1<- c(rnorm(sampleSize/2,i,runif(1,1,5)),rnorm(sampleSize/2,i*1.2,runif(1,1,5)))
> x2 <-
> c(rnorm(sampleSize/4,i*1.1**1000*,runif(1,1,5)),rnorm(sampleSize/4,i*1.4**1000*,runif(1,1,5)),rnorm(sampleSize/2,i*1.7**1000*,runif(1,1,5)))
> bbgg <- cbind(bbgg,x1,x2)
> }

Can you remind me what these variables are? Is it just
a matter of one set of regression-like coefficients being scaled
to 1000 times the others? This is a HUGE difference on
a transformed scale like the logistic if that's an issue
(I can't recall the details of your model, I'm afraid).

Do you change priors in the model, too? Change of data scale
for predictors should be accompanied by an inverse change
of scale on the priors. Making predictors 1000 times as big
means coefficients are 1/1000 times as big.

> Placing print(lp__) right before the end of model section shows that Stan is actively calculating log-likelihoods of my
> model although it doesn't move to the next iteration. In fact, the log-likelihood seems to be moving in a circular way.
>
> I know almost nothing about how HMC and NUTS work, so I don't have a good idea of why covariates with different
> magnitudes are making the Stan unable to continue. Do you know some work around for this problem?

Regularizing the predictors usually helps :-)

Regression with hugely varying coefficients and floating
point arithmetic don't mix well.

Have you tried init=0 option? That may give you a more stable
place from which to start. We don't have a user controllable
option for scale of initialization noise, and with big
regressions, it can cause problems.

We've been playing around a bit with adaptation, but the
problem with getting stuck early on is that Stan's probably
working very hard to find scales for the proposals that
can let it move.

If it calculates log likelihoods and then the params don't
move, it's because the proposals are getting rejected, which
is probably due to step sizes being too large. You could try
manually setting lower step sizes until you find one that works.
You shouldn't need to tune number of steps -- NUTS will handle
that.

- Bob

Marco

unread,
Apr 10, 2013, 11:15:06 PM4/10/13
to stan-...@googlegroups.com

Can you remind me what these variables are?
The model likelihood is the multiplication of a Bernoulli and a Poisson
beta are the regressors for the mean of the Poisson component (with log link function)
gamma are the regressors for the for the mean of the Bernoulli component (with logit link function).
The matrix of covariates is the same for both models.



Is it just
a matter of one set of regression-like coefficients being scaled
to 1000 times the others?
No, that was only for the simulated data, I did that only to reproduce the error happening on my real data, which hasn't been rescaled at that time.





Do you change priors in the model, too?  Change of data scale
for predictors should be accompanied by an inverse change
of scale on the priors.
Do I need to rescale an improper uniform prior too in this situation? Sorry for the n00b question, but I heard that I should consider a Jacobian here somehow depending on the rescale operation.





Regularizing the predictors usually helps :-)
yeeh! It's working pretty well and fast now that I divided all my database columns by their standard deviation.





Regression with hugely varying coefficients and floating
point arithmetic don't mix well.
You mean I could lose precision, have unreliable results and/or would this affect stan's convergence?





If it calculates log likelihoods and then the params don't
move, it's because the proposals are getting rejected, which
is probably due to step sizes being too large.  You could try
manually setting lower step sizes until you find one that works.
I tryed to turn Stan's epsilon, but even a very small change on it caused it to don't move the initial values at all and accept everything quickly (as relative posterior or to get stuck anyway. Only parametrization worked.

Bob Carpenter

unread,
Apr 11, 2013, 1:04:55 PM4/11/13
to stan-...@googlegroups.com


On 4/10/13 11:15 PM, Marco wrote:
...

>> Do you change priors in the model, too? Change of data scale
>> for predictors should be accompanied by an inverse change
>> of scale on the priors.
> Do I need to rescale an improper uniform prior too in this situation? Sorry for the n00b question, but I heard that I
> should consider a Jacobian here somehow depending on the rescale operation.

You only need to adjust the Jacobian for parameters,
not for data.

Rescaling predictors in a regression will affect the behavior
of the prior on the regression coefficients. But if it's
improper uniform, there's no effect. Rescaling by a constant
wont' affect the uniform probabilty.

In general, rescaling itself doesn't need a Jacobian:

log |d/dx c * theta| = |c|

which is a constant, so can be dropped from the calculation
in Stan.

>> Regularizing the predictors usually helps :-)
> yeeh! It's working pretty well and fast now that I divided all my database columns by their standard deviation.

Andrew will be happy to hear it, because it's his
standard operating procedure advice.

>> Regression with hugely varying coefficients and floating
>> point arithmetic don't mix well.
> You mean I could lose precision, have unreliable results and/or would this affect stan's convergence?

You lose precision, which makes everything harder to do.
It will affect convergence if it's small, and lead to not
being able to do estimation at all if the variation is too large.

- Bob

Marco

unread,
Apr 13, 2013, 6:20:48 PM4/13/13
to stan-...@googlegroups.com

>>> Regularizing the predictors usually helps :-)
>> yeeh! It's working pretty well and fast now that I divided all my
>> database columns by their standard deviation.
>
> Andrew will be happy to hear it, because it's his
> standard operating procedure advice.
>
Just a final note: after I added some categorical variables to problem,
it stopped working again, even after rescaling and re-centering the
other variables.

Then I tried to use only proper priors for all parameters and
hierarchical mutually informative priors for the parameters of the
categorical variables, which was my intention anyway. It didn't worked
either: Stan was still getting stuck.

But when I downloaded Stan 1.3 yesterday, surprisingly, it just worked!

It took Stan about 1.6 hour to sample with iter = 100 (50 adapts + 50
sampling). Haven't checked the convergence as it was only one small
chain till now as I don't have enough RAM for parallel instances of Stan =[.

Thanks very very much Bob Carpenter and Ben Goodrich for all the help
with my problem, I sure learned a lot!
Reply all
Reply to author
Forward
0 new messages