Meaning of lp__ and model complexity

2,267 views
Skip to first unread message

stan-fan

unread,
Jan 31, 2013, 4:55:48 PM1/31/13
to stan-...@googlegroups.com
Hi,

I wonder if the following approach is valid and would like to get your advice.

Let's say I have two models, model 1 with M parameters, and model 2 with M+2 (or M+alpha, alpha>1) parameters.

Let's say I fit a data set with the two models and lp__ (the accumulated log probability (is it equal to the sum of log likelihood?) ) values of the two models are let's say lp1 and lp2 (lp1 and lp2 are outputs (distributions) from Stan).

If lp1 is greater than lp2 (or vice versa), can I say one model is a better model than the other one even after model complexity is accounted for?

Thanks for your advice in advance.

Andrew Gelman

unread,
Jan 31, 2013, 5:01:03 PM1/31/13
to stan-...@googlegroups.com
No, you can't say that.  This will be made clearer (I hope) in chapter 7 of the new edition of BDA.  For now, you might want to read the literature on AIC.
A

--
You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

stan-fan

unread,
Jan 31, 2013, 5:07:11 PM1/31/13
to stan-...@googlegroups.com, gel...@stat.columbia.edu
Thanks for your quick response.

So, is that correct that lp__ is the sum of log "likelihood"?
To compare models using AIC (& Stan), can I just subtract the number of parameters from the lp__ value to compare the two models?

Andrew Gelman

unread,
Jan 31, 2013, 5:12:46 PM1/31/13
to stan-fan, stan-...@googlegroups.com
lp is log posterior, not just likelihood.

stan-fan

unread,
Jan 31, 2013, 5:34:43 PM1/31/13
to stan-...@googlegroups.com, stan-fan, gel...@stat.columbia.edu
Thanks for the clarification.

Sorry for more questions, but when is the new edition of Bayesian data analysis coming? Will it use Stan?

Bob Carpenter

unread,
Jan 31, 2013, 6:49:02 PM1/31/13
to stan-...@googlegroups.com
On 1/31/13 5:34 PM, stan-fan wrote:

> Sorry for more questions, but when is the new edition of Bayesian data analysis coming?

That's like asking a grad student when they'll finish
their thesis! I do think Andrew's pretty close, but like
theses, pretty close can still take a while to move to finished.
Then publication/printing/distribution takes a while, not
to mention copy editing if they'll have it copy edited.

> Will it use Stan?

Yes, the chapters on BUGS are replaced with chapters
on Stan.

And there's much more info on sampling including HMC
as well as point estimation like variational Bayes
and expectation propagation. And a whole new chapter
on non-parametric Bayes.

- Bob

Bob Carpenter

unread,
Jan 31, 2013, 6:51:58 PM1/31/13
to stan-...@googlegroups.com
lp__ is the total log probability accumulator.
What's required is that

p(params|data) propto exp(lp__)

If you use plain old Stan, lp__ winds up being equal
to the joint log probability, log p(params,data), with
any term in the log probability that doesn't involve
a parameter dropped.

If you write your own models manipulating lp__ directly,
you just need to preserve the proportionality. So
it might be log p(params|data) or log p(params,data).

- Bob


stan-fan

unread,
Feb 20, 2013, 11:28:34 AM2/20/13
to stan-...@googlegroups.com
Please let me clarify this a bit more. I'm basically trying to understand how to do model comparisons with Stan 1.1.1.

I coded up a toy example: predicting linear data with 3 (nested) models.

1. Linear regression. Y = beta0 + beta1 * X
2. 2nd order polynomial regression. Y = beta0 + beta1 * X + beta2 * X^2
3. 3rd order polynomial regression. Y = beta0 + beta1 * X + beta2 * X^2 + beta3 * X^3

This is data for this toy example (Toy_data.pdf): y = 10 + 0.1*x + noise w/ normal(mean=0, sd=1). N=100 data points.

I used 1000 burnin samples and 9000 samples after burnin w/ just 1 chain. I set the seed to a certain number(=1) for replication.


1. Linear regression. Y = intercept + slope*X. Here is an actual Stan code. The R code is also attached (toyExample_linear.R)

model {
  # prior
  beta0 ~ normal(0, 1.0E12);   # normal prior w/ mean = 0, SD = 1.0E12
  beta1 ~ normal(0, 1.0E12);   # normal prior w/ mean = 0, SD = 1.0E12
  sigma ~ uniform(0, 10);         # uniform prior for sigma

  for (i in 1:N) {
    y[i] ~ normal( beta0 + beta1 * x[i], sigma);      # Y_n = beta0 + beta1 * Xn + e_n where e_n ~Normal(0, sigma)
  }
}

Here is its output. Lp__ is -40.4.

       mean se_mean  sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
beta0  10.1       0 0.2   9.8  10.0  10.1  10.2  10.5  2443    1
beta1   0.1       0 0.0   0.1   0.1   0.1   0.1   0.1  2520    1
sigma   0.9       0 0.1   0.8   0.9   0.9   1.0   1.1  2841    1
lp__  -40.4       0 1.3 -43.6 -41.0 -40.0 -39.5 -39.0  2417    1


2. This is the 2nd model. Y = intercept + slope*X + beta2*X^2 (toyExample_2poly.R)

model {
  # prior
  beta0 ~ normal(0, 1.0E12);
  beta1 ~ normal(0, 1.0E12);
  beta2 ~ normal(0, 1.0E12);
  sigma ~ uniform(0, 10);

  for (i in 1:N) {
    y[i] ~ normal( beta0 + beta1 * x[i] + beta2 * pow(x[i], 2), sigma);   # Y_n = beta0 + beta1 * Xn + beta2 * Xn^2 + e_n where e_n ~Normal(0, sigma)
  }
}
             
Here is its output. Lp__ is -40.9.

       mean se_mean  sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
beta0  10.1       0 0.3   9.5   9.9  10.1  10.3  10.6   876    1
beta1   0.1       0 0.0   0.1   0.1   0.1   0.1   0.1   723    1
beta2   0.0       0 0.0   0.0   0.0   0.0   0.0   0.0   735    1
sigma   0.9       0 0.1   0.8   0.9   0.9   1.0   1.1   378    1
lp__  -40.9       0 1.4 -44.3 -41.7 -40.6 -39.8 -39.1  1063    1


3. This is the 3rd model.  Y = intercept + slope*X + beta2 * X^2 + beta3 * X^3 (toyExample_3poly.R)

model {
  # prior
  beta0 ~ normal(0, 1.0E12);
  beta1 ~ normal(0, 1.0E12);
  beta2 ~ normal(0, 1.0E12);
  beta3 ~ normal(0, 1.0E12);
  sigma ~ uniform(0, 10);

  for (i in 1:N) {
    y[i] ~ normal( beta0 + beta1 * x[i] + beta2 * pow(x[i], 2) + beta3 * pow(x[i], 3), sigma); 
  }
}


Here is its output. Lp__ is -41.2.
       mean se_mean  sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
beta0  10.1     0.0 0.4   9.4   9.8  10.2  10.4  10.9    86    1
beta1   0.1     0.0 0.0   0.0   0.1   0.1   0.1   0.2    90    1
beta2   0.0     0.0 0.0   0.0   0.0   0.0   0.0   0.0    91    1
beta3   0.0     0.0 0.0   0.0   0.0   0.0   0.0   0.0    92    1
sigma   0.9     0.0 0.1   0.8   0.9   0.9   1.0   1.1    26    1
lp__  -41.2     0.1 1.5 -45.0 -42.0 -40.9 -40.1 -39.2   393    1


So,

1. Is it correct that lp__ value "penalizes" the complexity of the functions (how much?), but not necessarily the number of parameters?

2. Do you suggest using AIC (rather than BIC or other adjustment) to account for the number of parameters? I'm aware of various assumptions underlying AIC vs. BIC.

3. I noted that n_eff is smaller for more complex models (presumably because of correlations between parameters) and wonder if it can be related to lp__..


Stan is (really) awesome but I would like to use it to properly compare various models as well. I wish I could read the Chap 7 of the new BDA now but any other useful papers/chapters related to this topic would be appreciated. Thanks a lot for your help in advance.


I'm having trouble uploading small files, so let me upload a zip file containing all the codes/figure.
toyData.zip

Andrew Gelman

unread,
Feb 20, 2013, 4:17:19 PM2/20/13
to stan-...@googlegroups.com
Hi, just a few comments:
1.  Never just run 1 chain.  We suggest running 4 chains.
2.  Don't do normal(0,10^12) priors.  Use something weak, if you want, but keep it computationally stable.  For example, you might try normal(0,10) or normal(0,100).  (Obviously it will depend on the scale of the probelm but that's the basic idea).
3.  I'd recommend AIC or DIC, not BIC.  DIC isn't perfect but it's often ok. 
I hope this helps!
AG

--
You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 
<toyData.zip>

Bob Carpenter

unread,
Feb 21, 2013, 11:37:23 AM2/21/13
to stan-...@googlegroups.com
More inline below.

On 2/20/13 4:17 PM, Andrew Gelman wrote:
> Hi, just a few comments:
> 1. Never just run 1 chain. We suggest running 4 chains.
> 2. Don't do normal(0,10^12) priors. Use something weak, if you want, but keep it computationally stable. For example,
> you might try normal(0,10) or normal(0,100). (Obviously it will depend on the scale of the probelm but that's the basic
> idea).

Stan will also work with improper priors as long as the
posteriors are proper.

> 3. I'd recommend AIC or DIC, not BIC. DIC isn't perfect but it's often ok.

In order to compute these things, you'll need to store
the log likelihoods somewhere. Stan isn't really set up
to make that easy, as it doesn't know the likelihood from
the prior. So it may require some redundant calculations in
the generated quantities block. For instance,

data {
int<lower=0> N;
real y[N];
}
parameters {
real mu;
}
model {
mu ~ normal(0,1); // prior
y ~ normal(mu,1); // likelihood
}
generated quantities {
real log_likelihood;
log_likelihood <- normal_log(y,mu,1);
}

But the real problem is that they also need the behavior of the
model at the mean of the samples, which isn't so easy to code
in Stan. (We've been thinking of adding accumulators like running
averages to make this possible within Stan, but don't have any
concrete plans.)
lp__ is just the log probability function as defined
by the model (and the log Jacobian determinants of any
implicit transforms).

When you write:

y ~ normal(0,1);

Stan translates this to:

lp__ <- lp__ + normal_log(y,0,1);

All Stan models do is define the log probability function because
that's all HMC needs.

>> 2. Do you suggest using AIC (rather than BIC or other adjustment) to account for the number of parameters? I'm aware
>> of various assumptions underlying AIC vs. BIC.

>> 3. I noted that n_eff is smaller for more complex models (presumably because of correlations between parameters) and
>> wonder if it can be related to lp__..


Not sure what you mean by this.

>> Stan is (really) awesome but I would like to use it to properly compare various models as well. I wish I could read
>> the Chap 7 of the new BDA now but any other useful papers/chapters related to this topic would be appreciated. Thanks
>> a lot for your help in advance.

It should be out in a few months.

You'll notice that Andrew doesn't do a lot of traditional
model comparison using stats like BIC. There is a discussion
of model checking in the 2nd edition of BDA and also in
his regression book with Jennifer.

I should add something to the manual discussing these
issues. At various times, we've thought of adding flags to
the model statements to classify them as being in the prior
or in the likelihood.

>> I'm having trouble uploading small files, so let me upload a zip file containing all the codes/figure.

The problem is suffixes. The newsgroup doesn't like suffixes such
as .stan or .R that it's never heard of. If you append .txt after
each, they'll go through.

- Bob

Hang Chen

unread,
May 15, 2014, 11:20:06 AM5/15/14
to stan-...@googlegroups.com
Hi,
I am new to Stan and really appreciate it if anyone can help me with a related lp__ issue.

Specifically, when I tried to run a basic stochastic frontier model, I got positive lp__..
(I got warning message below as well:
Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Error in function Stan::prob::normal_log(N4stan5agrad3varE): Location parameter is 1.#INF:0, but must be finite.

Does anyone know what it means by having positive lp__? Is there any way to resolve this issue, such as setting the lower bound for sigma parameter in normal (.,.) ?

Thanks a lot!!

Bob Carpenter

unread,
May 15, 2014, 11:23:50 AM5/15/14
to stan-...@googlegroups.com
lp__ is just the value of the log of the total (parameters plus density)
probability density function less any constant normalizers. The
sampling statements and increment_log_prob() statements both add to
lp__.

It is not restricted to be negative.

If you want help, please send your model and let us know
how you're initializing.

The problem you're having is with location, not with scale.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

Reply all
Reply to author
Forward
0 new messages