PSIS-LOO for trial-by-trial modeling

526 views
Skip to first unread message

Lei Zhang

unread,
Dec 18, 2015, 9:22:59 AM12/18/15
to Stan users mailing list
Hi Stan,

I've been modeling a 2-armed bandit task with a basic reinforcement learning model in a trial-by-trial fashion. For instance, I have 129 subjects who performed the same task for 100 trials, and run overall 4000 samples in Stan. I've implemented 'loglik' in the generated quantities section, who gave me a 4000 by 129 log-likelihood matrix. For such a trial-by-trial model, we normally do not consider the log-likelihood per subject per trial, but rather, we sum the loglik for all trials and get the loglik only per subject, because each trial is not full independent according the model, i.e. the expected value on the current trial is updated from the last trial.

When passing two model fits to the LOO() function, I obtained a huge proportion of subjects whose k-hat are larger than 0.5 and even 1.0 (see below). Please note that, each pointwise k-hat represents a subject, rather than a single data point. For even more complex models, we have more subjects whose k-hat are greater than 1.

(1) I wonder if I am doing this correctly? Is the model comparison based on PSIS-LOO still reliable? 
(2) I also read from the paper (Vehtari et al, 2015) that I would consider computing k-fold CV instead, however as this is not implemented in the {LOO} package, I have no clear idea how I could do it for our trial-by-trial model. Am I suppose run a k(-trial)-fold CV per subject, or run a k(-subject)-fold CV for all the subjects? Any suggestions about how to calculate a k-fold CV for such a reinforcement learning model?
(3) or alternatively, how about using WAIC?

You may want to check the attached stan model file. Any input/suggestion is welcome.

> loo_RL
Computed from 4000 by 129 log-likelihood matrix

         Estimate    SE
elpd_loo  -6920.1 150.0
p_loo       172.6  13.0
looic     13840.2 299.9
Warning messages:
1: 110 (85.3%) Pareto k estimates between 0.5 and 1
2: 19 (14.7%) Pareto k estimates greater than 1

> loo_RLnc
Computed from 4000 by 129 log-likelihood matrix

         Estimate    SE
elpd_loo  -6562.5 166.6
p_loo       230.7   8.9
looic     13125.1 333.2
Warning messages:
1: 74 (57.4%) Pareto k estimates between 0.5 and 1
2: 55 (42.6%) Pareto k estimates greater than 1

> compare(loo_RL, loo_RLnc)
elpd_diff        se   weight1   weight2
    357.6      42.3       0.0       1.0


Thanks in advance!

Lei


 
RevLearn_RL.stan

Aki Vehtari

unread,
Dec 18, 2015, 10:21:30 AM12/18/15
to Stan users mailing list


On Friday, December 18, 2015 at 4:22:59 PM UTC+2, Lei Zhang wrote:
Hi Stan,

I've been modeling a 2-armed bandit task with a basic reinforcement learning model in a trial-by-trial fashion. For instance, I have 129 subjects who performed the same task for 100 trials, and run overall 4000 samples in Stan. I've implemented 'loglik' in the generated quantities section, who gave me a 4000 by 129 log-likelihood matrix. For such a trial-by-trial model, we normally do not consider the log-likelihood per subject per trial, but rather, we sum the loglik for all trials and get the loglik only per subject, because each trial is not full independent according the model, i.e. the expected value on the current trial is updated from the last trial.

It is sensible to consider the leave-one-subject-out performance as you describe. I often make such performance estimates and they can give different results than leave-one-observation-out.
 

When passing two model fits to the LOO() function, I obtained a huge proportion of subjects whose k-hat are larger than 0.5 and even 1.0 (see below). Please note that, each pointwise k-hat represents a subject, rather than a single data point. For even more complex models, we have more subjects whose k-hat are greater than 1.

(1) I wonder if I am doing this correctly? Is the model comparison based on PSIS-LOO still reliable? 

Based on a quick look, it seems your code is correct. Large khats tell that PSIS-LOO is not reliable. PSIS (and TIS, IS, and WAIC) work when leave-one-out distribution is close to the full posterior. It is more likely that when you are leaving 100 observations out that the distribution changes too much for the importance sampling able to work well. Figures 5 and 6 in our paper Pareto Smoothed Importance Sampling http://arxiv.org/abs/1507.02646 shows how the performance of PSIS (and IS and TIS) gets worse when the number dimensions (in your case the number of left out observations) increase.

 
(2) I also read from the paper (Vehtari et al, 2015) that I would consider computing k-fold CV instead, however as this is not implemented in the {LOO} package, I have no clear idea how I could do it for our trial-by-trial model. Am I suppose run a k(-trial)-fold CV per subject, or run a k(-subject)-fold CV for all the subjects? Any suggestions about how to calculate a k-fold CV for such a reinforcement learning model?

I recommend k-fold-CV where data is divided based on subjects (e.g. with k=10, leave 12-13 subjects out in each fold). Vehtari et al (2015) has an example code. If you are able to compute separate log_lik for each subject (summing 100 trials) as in your current code, then you can do the same also for the test data in k-fold-CV. We could add some helper functions to make the data divisions to training and test sets (I have them in Matlab, but need to port to R and Python, too).
 
(3) or alternatively, how about using WAIC?

If PSIS-LOO fails then WAIC fails, too. Currently our WAIC code just doesn't warn you about that. Adding warnings to our WAIC code is on TODO list.

Aki

Lei Zhang

unread,
Dec 18, 2015, 11:23:39 AM12/18/15
to Stan users mailing list
Thanks Aki, for your clear and comprehensive comment and suggestion! I now have two more questions concerning implementing the K-fold-CV. I apologize if those questions seem too naive.

(1) partition data. If I would run a 10-fold-CV, leading to 12-13 subjects per fold. Do I randomly select subjects per fold, or systematically? Do I only run model fitting K times, or with M different data partition method and fit the model M x K times?

(2) obtaining the k-fold-cv output. Suppose I run a 10-fold-cv with systematically selecting subjects (i.e. first 13, second 13, and so on...), and I obtain ten log-likelihood matrix, one for each testing fold. What to do next to summarize the results and compare different models? 
As I understood the paper, once I get 10 loglik matrix, e.g. for a single 4000-by-13 loglik matrix L, I sum L across the time of simulations S and divide the results by S: rowSum(L)/S; then I sum such results across the 10 folds. This is then the k-fold-cv result for model comparison. Am I correct here? And does it also make sense to multiply it with -2 to translate it on the information criterion scale?

many thanks again!
Lei

Sebastian Weber

unread,
Dec 18, 2015, 11:35:31 AM12/18/15
to Stan users mailing list
Hi!

Apologies for going stray slightly, but this happens so often: Suppose I want to do k-fold-cv on the basis of subjects as in the example described by Lei - in his case I understood that each subject has exactly 100 trials. However, I am often facing the situation that subjects can have vastly different number of observations. Is a leave-subjects-out approach then meaningful or should on average each subject have the same number of observations?

These regular data-sets happen so rarely in my applications...

Thanks (Aki)!

Sebastian


On Friday, December 18, 2015 at 4:21:30 PM UTC+1, Aki Vehtari wrote:

Aki Vehtari

unread,
Dec 18, 2015, 1:22:34 PM12/18/15
to Stan users mailing list

On Friday, December 18, 2015 at 6:23:39 PM UTC+2, Lei Zhang wrote:
Thanks Aki, for your clear and comprehensive comment and suggestion! I now have two more questions concerning implementing the K-fold-CV. I apologize if those questions seem too naive.

(1) partition data. If I would run a 10-fold-CV, leading to 12-13 subjects per fold. Do I randomly select subjects per fold, or systematically? Do I only run model fitting K times, or with M different data partition method and fit the model M x K times?

If the subjects are exchangeable that is the order does not contain information then there is no need for random selection. If the order does contain information, e.g. in survival studies the later patients have shorter follow-up, then randomization is often useful. I would randomly permute the subjects and then systemically divide them to 10 groups. I would do only one random permutation and systematic division and run model fitting 10 times. Making several permutations can reduce variance, but the effect is usually negligible for stable inference. Other approaches for selecting subjects randomly have higher variance than the procedure I suggested above. In some cases it may be useful to use stratification to get more balanced groups.

 
(2) obtaining the k-fold-cv output. Suppose I run a 10-fold-cv with systematically selecting subjects (i.e. first 13, second 13, and so on...), and I obtain ten log-likelihood matrix, one for each testing fold. What to do next to summarize the results and compare different models? 
As I understood the paper, once I get 10 loglik matrix, e.g. for a single 4000-by-13 loglik matrix L, I sum L across the time of simulations S and divide the results by S: rowSum(L)/S; then I sum such results across the 10 folds. This is then the k-fold-cv result for model comparison. Am I correct here?

I would compute loglik for each subject and combine the results from different runs to get 4000-by-129 loglik matrix as in LOO. Instead of loo function you can compute elpd_loo as sum(log(mean(exp(log_lik)))), where mean is along the draws and sum along the subjects. You can use pointwise contributions (before the sum) as in LOO. Maybe we should make a helper function, which would create from k-fold-cv log_liks a object which could be given to compare function. I have examples in Matlab, but unfortunately I'm still learning R.

 
And does it also make sense to multiply it with -2 to translate it on the information criterion scale?


I hate multiplying by -2. Without it difference between lpd and elpd_loo is related to the effective number of parameters. I also prefer that larger value is better. 

Aki

Aki Vehtari

unread,
Dec 18, 2015, 1:27:16 PM12/18/15
to Stan users mailing list

On Friday, December 18, 2015 at 6:35:31 PM UTC+2, Sebastian Weber wrote:
Hi!

Apologies for going stray slightly, but this happens so often: Suppose I want to do k-fold-cv on the basis of subjects as in the example described by Lei - in his case I understood that each subject has exactly 100 trials. However, I am often facing the situation that subjects can have vastly different number of observations. Is a leave-subjects-out approach then meaningful or should on average each subject have the same number of observations?

Leave-subjects-out approach is meaningful in this case. If you want each subject have same weight in the predictive performance estimate than you can divide log_lik for each subject by the number observations for that subject.

Aki

Lei Zhang

unread,
Dec 19, 2015, 5:09:16 AM12/19/15
to Stan users mailing list
Thanks again Aki, your explanation is indeed helpful!

Just only one more concern: by computing the elpd_loo as you commented, is there also a way to calculate the estimated effective number of parameters 'p_loo' as in the loo()  function? Could I still derive p_loo using the variance-based calculation from the 4000-by-129 loglik matrix? If so, that will be really great.

BTW, would you mind sharing your matlab code for k-fold-cv as you mentioned? Is it one of your .m file on your GitHub repository?

Many thanks.
Lei

Aki Vehtari

unread,
Dec 19, 2015, 12:44:38 PM12/19/15
to Stan users mailing list
On Saturday, December 19, 2015 at 12:09:16 PM UTC+2, Lei Zhang wrote:
Thanks again Aki, your explanation is indeed helpful!

Just only one more concern: by computing the elpd_loo as you commented, is there also a way to calculate the estimated effective number of parameters 'p_loo' as in the loo()  function? Could I still derive p_loo using the variance-based calculation from the 4000-by-129 loglik matrix? If so, that will be really great.

Do you really need an estimate of the effective number of parameters? Some bias correction terms in information criterion literature have been called as effective number of parameters, but 1) you don't need that because you already have the predictive performance estimate, 2) interpretation of the bias correction or difference between lpd and elpd as the effective is not clear (e.g. in 8 schools example p_waic is allways <= 4, although there is 9 parameters in the model, and difference between lpd and elpd can be >9). You can still compute the difference between lpd and elpd_kfcv and use it as some measure of difference between within-sample and out-of-sample performance, but interpreting it as the effective number of parameters is tricky.
 

BTW, would you mind sharing your matlab code for k-fold-cv as you mentioned? Is it one of your .m file on your GitHub repository?

You can find it one part of that in GPstuff https://github.com/gpstuff-dev/gpstuff/blob/master/misc/cvit.m
I can add in next few days a full example with Stan to PSIS repo or somewhere...

Aki

Lei Zhang

unread,
Dec 21, 2015, 9:00:00 AM12/21/15
to Stan users mailing list
Thanks again Aki. I agree that it becomes tricky when interpreting the estimated effective number of parameters. However, when reporting LOO, PSIS-LOO, or kfcv, given the fact that they are rather new methods for model comparison, I wonder if it is plausible to maintain the results from AIC, BIC, DIC etc. For such consistency reason, I might put the bias correction alongside. 

Any comments on the above point, Aki? and other experts opinion?

Last but not least, thanks (Aki) for putting effort on the matlab code. This will be a great contribution to the community.

Best,
Lei

Avraham Adler

unread,
Dec 21, 2015, 10:20:28 AM12/21/15
to stan-...@googlegroups.com
On Friday, December 18, 2015 at 1:22:34 PM UTC-5, Aki Vehtari wrote:
I hate multiplying by -2. Without it difference between lpd and elpd_loo is related to the effective number of parameters. I also prefer that larger value is better. 

Aki


To be fair, the difference is still directly proportional to the effective number of parameters, just multiplied by -2. If I recall correctly, the -2 comes from the linear regression world where the residual sum of squares (or its log, I forget) is twice the negative loglikihood and is thus related to the deviance and approximately chi square. Burnham and Anderson discuss it in passing in one of their web papers on AIC myths, but atcual text on the original reason for the -2 is admittedly hard to pin down.

Avi

Aki Vehtari

unread,
Dec 21, 2015, 10:33:52 AM12/21/15
to Stan users mailing list


On Monday, December 21, 2015 at 4:00:00 PM UTC+2, Lei Zhang wrote:
Thanks again Aki. I agree that it becomes tricky when interpreting the estimated effective number of parameters. However, when reporting LOO, PSIS-LOO, or kfcv, given the fact that they are rather new methods for model comparison, I wonder if it is plausible to maintain the results from AIC, BIC, DIC etc. For such consistency reason, I might put the bias correction alongside. 

If it is tricky, maybe it should not then be reported by default. I'm worried that people over-interpret the meaning of effective number of parameters (as they have been doing for years). It can be useful getting some insight to the complexity of the model (or problems in computation).

Oldest references to cross-validation are before AIC, and Bayesian LOO and kfcv were proposed by Geisser in 1975, so LOO and kfcv are rather old. IS-LOO is from 1992, and only new method is PSIS-LOO, which is just more reliable and accurate way to compute IS-LOO estimates. BIC computes something completely different and should not be listed with other ICs. If you are using Bayesian models, then you should also forget AIC and DIC.


Last but not least, thanks (Aki) for putting effort on the matlab code. This will be a great contribution to the community.

k-fold-CV demo with MatlabStan is now at https://github.com/avehtari/PSIS/tree/master/m
demos for R and Python will appear later

Aki

Lei Zhang

unread,
Dec 21, 2015, 10:42:02 AM12/21/15
to Stan users mailing list
Thanks a lot! I indeed gained a lot from all the discussion on this thread, both conceptually and computationally. I guess I could mark this discussion as 'completed'.

Lei

jacquie...@gmail.com

unread,
Feb 23, 2016, 6:44:11 AM2/23/16
to Stan users mailing list
Hi all,

related to this questions, I was wondering whether I should also divide the log_lik for each subject by the number of observations for that subject if I want to use psis-loo to compare models? I have 29 subjects and they each have between 190 and 200 observations on a psychological learning task- which means that if I compute for each subject the log likelihood as the sum of the log_lik across all observations, the ones with more trials will have a higher log_lik.

When I put these log_lik into psisloo (matlab), most ks are >1. If I however divide the log_lik first by the number of observations per subject, then no ks are larger than 0.02, but some are negative (~-0.1). Does this mean that I can compare models based on the psis-loo?

Many thanks
Jacquie

Andrew Gelman

unread,
Feb 23, 2016, 6:23:00 PM2/23/16
to stan-...@googlegroups.com
Hi, you don’t want to be dividing by number of observations.  There are more general questions of what to do if you care more about some observations than others.  In theory one could define a weighted LOO or WAIC that counts some observations more than others.  I have not seen this but I could easily imagine settings where it would make sense, as a sort of poststratification.

Regarding k:  You don’t want to divide by number of observations.  So if k is greater than 1, this does indicate high variance.
A


--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jonah Gabry

unread,
Feb 23, 2016, 7:04:20 PM2/23/16
to Stan users mailing list
As an aside, negative k estimates are not a problem. The shape parameter of the generalized Pareto distribution can be any real number and only positive values (above 1/2, and above 1 in particular) are problematic. 

Bob Carpenter

unread,
Feb 23, 2016, 7:36:02 PM2/23/16
to stan-...@googlegroups.com
Weighting categories equally is not uncommon in search and
natural language classifier evaluations, which are a kind
of model comparison. It's known as macro-averaging:

http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-text-classification-1.html

It ignores the population distribution and treats each category
equally, so has exactly the opposite effect to what Andrew was
suggesting with poststratification. Macro-averaging inflates
the importance of each item from a small category tremendously.

You also often get training data that has undergone stratified
sampling, with an equal number of items for each category even
though that doesn't represent the actual distribution of items.

- Bob

Andrew Gelman

unread,
Feb 23, 2016, 7:51:42 PM2/23/16
to stan-...@googlegroups.com
Bob:
Your suggestion is not opposite to poststratification, it’s a special case of poststratification where you adjust to a distribution in which all categories are weighted equally.
A

Bob Carpenter

unread,
Feb 23, 2016, 8:39:59 PM2/23/16
to stan-...@googlegroups.com
Absolutely. But the effect is the opposite of the usual
poststratifications I've seen, the point of which has been
to take some kind of biased data sample and adjust it for
a real population of interest (e.g., voters in the Xbox
paper).

With macro averaged evaluations, you typically have an unbiased
random sample from the population of interest, and then you
poststratify to something that weights each category equally and
doesn't represent any population of interest (except maybe to
consumers of classifier evaluations involving macro averaging).

- Bob

Andrew Gelman

unread,
Feb 23, 2016, 9:20:23 PM2/23/16
to stan-...@googlegroups.com
It’s similar to age adjustment. It’s mathematically the same as poststrat but I agree that the applicaiton is different.

Bob Carpenter

unread,
Feb 23, 2016, 11:46:00 PM2/23/16
to stan-...@googlegroups.com
Yes, the math is exactly the same.

Speaking of which, we still really need to figure out what
to do with ragged R-hat and n_eff where we have N chains
of varying sizes and want to combine them. To make the
problem more interesting, the chains might even come from
different samplers with different autocorrelation rates.

Sorry for hijacking this thread!

- Bob
Reply all
Reply to author
Forward
0 new messages