model comparison when WAIC and LOO give the opposite results

Bruno Nicenboim

unread,

Nov 23, 2015, 3:52:48 AM11/23/15

to Stan users mailing list

Hi

I'm a bit lost of what to do when WAIC and LOO give the opposite results with real data. I'll explain further

I'm using Rstan and I'm comparing two models, one is a shifted lognormal distribution, basically:

...
lognormal_log(y- psi,mu,sigma)

and the other one is a mixture between that distribution and a uniform distribution which should be a contaminated distribution:

...
 
P_dist[1] ~  beta(0.1,0.9);

  for (i in 1:N_obs){
    real distrib[2];
    distrib[1] <-  log(P_dist[1])+ uniform_log(rt[i],range_rt[1],range_rt[2]);
    distrib[2] <-  log(P_dist[2])+ shifted_lognormal_log(rt[i],mu[i], sigma,psi[i]);
    increment_log_prob(log_sum_exp(distrib));
  }

with shifted_lognormal_log is just a function that does what I showed before.

Both work relatively well after I increased the tree depth (just that the intercept of the distributions has an Rhat of 1.02, but since it recovers the parameters I didn't have the patience to run the code for longer time). I tried the models with simulated data and they recover the parameters quite well. If I generate data with the contaminated distribution, and I use the package loo, and the function compare, both waic and loo agree that the second model has a weight of 1.

Now when I try to compare the two models in a realistic dataset I get this:

> compare(loo_sl,loo_sl_mixt)
elpd_diff        se   weight1   weight2 
     -6.6       8.2       1.0       0.0 
> compare(waic_sl,waic_sl_mixt)
elpd_diff        se   weight1   weight2 
      2.5       7.0       0.1       0.9

From Vehtari et al 2015, I understand that "for influential observations WAIC underestimates the effect of leaving out one observation". And that LOO "relies on inference from a smaller subset of the data being close to inference from the full dataset". So my objective is to figure out if this uniform distribution is good enough to capture the outliers, or if I must try something else. How can I check which of the compare results is telling me what I want to know?

Thanks!

(By the way, I was pretty sure that there was a tag for modeling comparison, but it's no longer there (or maybe it was never there), I hope my question is still relevant to the mailing list)

Bruno

Aki Vehtari

unread,

Nov 23, 2015, 7:04:36 AM11/23/15

to Stan users mailing list

From Vehtari et al 2015, I understand that "for influential observations WAIC underestimates the effect of leaving out one observation". And that LOO "relies on inference from a smaller subset of the data being close to inference from the full dataset". So my objective is to figure out if this uniform distribution is good enough to capture the outliers, or if I must try something else. How can I check which of the compare results is telling me what I want to know?

Do you have influential observations? What are the khat values for each observation? Have tried making a scatter plot for pointwise loo and waic results (comparing models and comparing psis-loo vs waic)?

Aki

Jonah

unread,

Nov 23, 2015, 12:49:19 PM11/23/15

to Stan users mailing list

Hi Bruno, to follow up on Aki's suggestions, since you're using the loo R package, the khats (the estimates of the shape parameter of the generalized Pareto distributions) as well as the pointwise contributions for each observation can be found in the object returned when you call the loo function. For waic there are no khat estimates but the pointwise contributions are there.

You can also plot the khats using the plot.loo method.

Jonah

Bruno Nicenboim

unread,

Nov 23, 2015, 1:09:31 PM11/23/15

to stan-...@googlegroups.com

Thanks Jonah and Aki,

So it seems I do have influential observations. And the model that supposed to take care of the outliers (the second one, which is the one that includes the mixture distribution) seems to have more influential obs. (more k estimates greater than 1 at least)

> loo_sl
Computed from 3000 by 3782 log-likelihood matrix

         Estimate    SE
elpd_loo -24016.6  83.9
p_loo       213.5   9.6
looic     48033.3 167.9
Warning messages:
1: 42 (1.1%) Pareto k estimates between 0.5 and 1 
2: 2 (0.1%) Pareto k estimates greater than 1 
3: See PSIS-LOO description (?'loo-package') for more information

> loo_slwo0
Computed from 3000 by 3782 log-likelihood matrix

         Estimate    SE
elpd_loo -24023.2  84.7
p_loo       232.6  12.6
looic     48046.5 169.4
Warning messages:
1: 31 (0.8%) Pareto k estimates between 0.5 and 1 
2: 15 (0.4%) Pareto k estimates greater than 1 
3: See PSIS-LOO description (?'loo-package') for more information

I'm not so sure what to see in the plots. If I plot the khat, I have bunch of them above 0.5 and 1 (which I also see in the summary).

I plotted the pointwise loo and waic for the real data and simulated data (the complex model that has the mixture ends with wo0). I'm not sure if this is what Aki was asking. I noticed that the plots for real data and simulated data don't look that different, but for the simulated data, I get a clear preference for the mixture model with both loo and waic:

> compare(loo_sim_sl,loo_sim_slwo0)
elpd_diff        se   weight1   weight2 
     37.0      14.6       0.0       1.0 
> compare(waic_sim_sl,waic_sim_slwo0)
elpd_diff        se   weight1   weight2 
     37.8      14.6       0.0       1.0

And I have more influential observations, but all of them are between .5 and 1. (And by the way, the simulated data is smaller, about half of the real one)

> loo_sim_sl
Computed from 3000 by 800 log-likelihood matrix

         Estimate    SE
elpd_loo  -5305.1  53.0
p_loo        59.8   3.6
looic     10610.3 106.0
Warning messages:
1: 26 (3.2%) Pareto k estimates between 0.5 and 1 
2: See PSIS-LOO description (?'loo-package') for more information 
> loo_sim_slwo0
Computed from 3000 by 800 log-likelihood matrix

         Estimate   SE
elpd_loo  -5268.1 48.9
p_loo        67.2  3.4
looic     10536.3 97.7
Warning messages:
1: 25 (3.1%) Pareto k estimates between 0.5 and 1 
2: See PSIS-LOO description (?'loo-package') for more information

--
You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/g5GDNDCamj8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Bests,

Bruno Nicenboim

pointwise_real_data

pointwise_simulated_data

Aki Vehtari

unread,

Nov 24, 2015, 6:57:01 AM11/24/15

to Stan users mailing list

This many khats>1 means trouble and means that you can't trust IS-LOO nor WAIC, and you need to use k-fold-CV or change your model to be more robust. I know you had log-uniform part for the contamination, but I don't think that uniform with some interval is necessarily a good choice as you get discontinuity in the density. How about making the second component to be a wider log-normal distribution?

The fact that you don't have khats>1 with the simulated data shows that your real data is more surprising. You could check the observed values with khat>1

More complex model may have larger khats as then these influential observations can change the posterior more (making importance sampling leave one out approximation more difficult). The log-uniform is likely to be problematic, too..

Diagonal line in the plots would help, but you can see that most of points behave nicely and you can see how the mixture model change the values, but then you can also the group of problematic points.

Aki

Bruno Nicenboim

unread,

Nov 24, 2015, 10:42:19 AM11/24/15

to stan-...@googlegroups.com

Thanks! A couple of comments and questions inline:

On Tue, Nov 24, 2015 at 12:57 PM, Aki Vehtari <aki.v...@aalto.fi> wrote:

This many khats>1 means trouble and means that you can't trust IS-LOO nor WAIC, and you need to use k-fold-CV

K-fold-CV involves fitting each model at least 10 times in Stan, right? I think I understood it good enough to implement it, but I would like to avoid doing it. (There's no package in R that does it in the same way as loo, right?)

or change your model to be more robust. I know you had log-uniform part for the contamination, but I don't think that uniform with some interval is necessarily a good choice as you get discontinuity in the density. How about making the second component to be a wider log-normal distribution?

I'm right now trying to do that, but all the papers about reaction times (I'm fitting reading times) that try to deal with outliers like this, use an uniform for the contaminated distribution. With a wide lognormal, I have to decide the location and shape, since I don't think I should deduce them from the data. Have you seen some other paper that uses that? Maybe I'm limited looking for psychology/psycholinguistic papers with reaction times. Or do you have any recommendation to help me decide on shape and location?

The fact that you don't have khats>1 with the simulated data shows that your real data is more surprising. You could check the observed values with khat>1

They are relatively fast observations, but I don't notice anything particular strange about them.

More complex model may have larger khats as then these influential observations can change the posterior more (making importance sampling leave one out approximation more difficult). The log-uniform is likely to be problematic, too..

In any case, the objective of the mixture model was to get rid of the observations that weren't part of the shifted lognormal distribution I was interested in. Since I have more influential observations in this mixt model than in the simpler one, can I conclude that the mixt. model is not doing what it should?

Thanks again

Bruno

--

Bests,

Bruno Nicenboim

Krzysztof Sakrejda

unread,

Nov 24, 2015, 11:03:30 AM11/24/15

to stan-...@googlegroups.com

On Tuesday, November 24, 2015 at 10:42:19 AM UTC-5, Bruno Nicenboim wrote:

Thanks! A couple of comments and questions inline:

On Tue, Nov 24, 2015 at 12:57 PM, Aki Vehtari <aki.v...@aalto.fi> wrote:
This many khats>1 means trouble and means that you can't trust IS-LOO nor WAIC, and you need to use k-fold-CV
K-fold-CV involves fitting each model at least 10 times in Stan, right? I think I understood it good enough to implement it, but I would like to avoid doing it. (There's no package in R that does it in the same way as loo, right?)

or change your model to be more robust. I know you had log-uniform part for the contamination, but I don't think that uniform with some interval is necessarily a good choice as you get discontinuity in the density. How about making the second component to be a wider log-normal distribution?
I'm right now trying to do that, but all the papers about reaction times (I'm fitting reading times) that try to deal with outliers like this, use an uniform for the contaminated distribution.

You are (likely) seeing what Aki mentions where the log-normal + uniform mixture has a kink in it so the uniform takes care of high outliers but it does not necessarily take care of medium-high outliers. I'm attaching a plot that displays this issue for a weibull-uniform mixture. Black dots have k < 1 and colored dots (by group) have a k > 1. The factors "delivery_index" and "province_index" are just grouping variables. In my case this doesn't matter much as I only need to predict where the bulk of the distribution is but your mileage may vary (and I assume parameter estimates are a little off).

Krzysztof

like-this.png

Aki Vehtari

unread,

Nov 24, 2015, 3:26:00 PM11/24/15

to Stan users mailing list

On Tuesday, November 24, 2015 at 6:03:30 PM UTC+2, Krzysztof Sakrejda wrote:

You are (likely) seeing what Aki mentions where the log-normal + uniform mixture has a kink in it so the uniform takes care of high outliers but it does not necessarily take care of medium-high outliers.

Excellent comment! The observations close to region between "non-outlier" and "outlier" is problematic, as if they are not outliers they are influence strongly the parameters of log-normal, but if they are outliers they don't, and thus observations in these region may have large difference in the posterior and leave-one-out posterior. Since the effect is so strong, it seems that log-normal is not good choice here either.

Aki

Bruno Nicenboim

unread,

Nov 24, 2015, 3:39:30 PM11/24/15

to stan-...@googlegroups.com

On Tue, Nov 24, 2015 at 9:26 PM, Aki Vehtari <aki.v...@aalto.fi> wrote:

On Tuesday, November 24, 2015 at 6:03:30 PM UTC+2, Krzysztof Sakrejda wrote:
You are (likely) seeing what Aki mentions where the log-normal + uniform mixture has a kink in it so the uniform takes care of high outliers but it does not necessarily take care of medium-high outliers.

Excellent comment! The observations close to region between "non-outlier" and "outlier" is problematic, as if they are not outliers they are influence strongly the parameters of log-normal, but if they are outliers they don't, and thus observations in these region may have large difference in the posterior and leave-one-out posterior. Since the effect is so strong, it seems that log-normal is not good choice here either.

Aki

I understand. So it's either that the uniform is not be the ideal distribution to capture the outliers or that the shifted-lognormal is not the ideal distribution for the "good observations" (or both).

But if I think that the distribution of "good observations" should still be a shifted-lognormal (because of theoretical reasons), sorry to be repetitive, but then I just can't know using WAIC or LOO if the uniform mixture gives any better fit than not using anything. Is it right?

--

Bests,

Bruno Nicenboim

Aki Vehtari

unread,

Nov 25, 2015, 2:54:44 AM11/25/15

to Stan users mailing list

On Tuesday, November 24, 2015 at 10:39:30 PM UTC+2, Bruno Nicenboim wrote:

I understand. So it's either that the uniform is not be the ideal distribution to capture the outliers or that the shifted-lognormal is not the ideal distribution for the "good observations" (or both).
But if I think that the distribution of "good observations" should still be a shifted-lognormal (because of theoretical reasons), sorry to be repetitive, but then I just can't know using WAIC or LOO if the uniform mixture gives any better fit than not using anything. Is it right?

Let's be careful with acronyms: You can use exact LOO and trust that (or k-fold-CV which is faster). For this specific data and model, approximative performance estimates WAIC and PSIS-LOO (and IS-LOO, and TIS-LOO) should not be trusted. For some other data (like your simulated data), they seem to be ok.

Our paper http://arxiv.org/abs/1507.04544 has Stan code example for k-fold-CV, too.

Even if there are theoretical reasons, the observation process may contain other sources of error. Your comparison to the simulated data strongly hints that theoretical reasons are not holding for the true data.

Aki

Bruno Nicenboim

unread,

Nov 25, 2015, 10:11:20 AM11/25/15

to stan-...@googlegroups.com

Thanks!! It was very useful, I'll look into k-fold-CV, and alternative ways to model the data.

Bests,

Bruno

Krzysztof Sakrejda

unread,

Nov 25, 2015, 10:19:09 AM11/25/15

to Stan users mailing list

On Tuesday, November 24, 2015 at 3:26:00 PM UTC-5, Aki Vehtari wrote:

On Tuesday, November 24, 2015 at 6:03:30 PM UTC+2, Krzysztof Sakrejda wrote:
You are (likely) seeing what Aki mentions where the log-normal + uniform mixture has a kink in it so the uniform takes care of high outliers but it does not necessarily take care of medium-high outliers.

Excellent comment! The observations close to region between "non-outlier" and "outlier" is problematic, as if they are not outliers they are influence strongly the parameters of log-normal, but if they are outliers they don't, and thus observations in these region may have large difference in the posterior and leave-one-out posterior. Since the effect is so strong, it seems that log-normal is not good choice here either.

Thanks Aki, I recently read your paper and I found it very helpful but I was surprised how little you play up the model criticism angle---after all even though the PSIS-LOO estimates might be too noisy to be useful, they still point out which data points are problematic with regards to the model. I used to have to do a fair bit more work to do that and the loo package wraps up a really nice metric.

That said, I don't think I have a complete understanding of how the pareto k estimates should be used since in my own model (png posted by me in a comment above, don't have it with me right now) the smattering of very-high-k points was mixed in among other less problematic points so (for models with many k>1 values) I think of it as a general indicator of the problem area in covariate space, rather than a distinct indicator that specific points are surprising. I'll ask a separate question on the list (once I sort out what it is).

Krzysztof

Aki Vehtari

unread,

Nov 25, 2015, 12:40:15 PM11/25/15

to Stan users mailing list

On Wednesday, November 25, 2015 at 5:19:09 PM UTC+2, Krzysztof Sakrejda wrote:

That said, I don't think I have a complete understanding of how the pareto k estimates should be used since in my own model (png posted by me in a comment above, don't have it with me right now) the smattering of very-high-k points was mixed in among other less problematic points so (for models with many k>1 values) I think of it as a general indicator of the problem area in covariate space, rather than a distinct indicator that specific points are surprising. I'll ask a separate question on the list (once I sort out what it is).

Note also that khat estimate is noisy estimate and if smaller amount of samples is used or if there is problems in the sampling the khat is more unreliable. Bruno did mention in the first message "Both work relatively well after I increased the tree depth (just that the intercept of the distributions has an Rhat of 1.02, but since it recovers the parameters I didn't have the patience to run the code for longer time).", so it seems there was some problems with the sampling. Even if the sampling goes fine the variance of the khat estimate increases when true k is large, and thus if you highlight such the khat>1, you may have those points surrounded by points with khat<1 but still realtively high khat. This variation is refelected also in the variance of PSIS estimate. Increasing the sample size will reduce the uncertainty in khat, and might produce less noisy view of the problematic region in your data. Recently I've been observing that it might be useful to flag khat>0.7 (instead of k>1) as problematic.

Aki

Reply all

Reply to author

Forward