Leave-one-group-out CV

508 views
Skip to first unread message

Jonathan Steinhart

unread,
Nov 7, 2016, 10:34:10 AM11/7/16
to R-inla discussion group
Hello,

I am a newcomer to INLA (which I'm grateful to have discovered exists!), and am attempting to use it to model some biological time series data. There are S different subjects, from each of whom we have one time series (which, incidentally, occur at different and irregular time points). I'm modeling them using  some fixed effects for covariates like age and weight, random per-subject (iid) intercepts, and one or more latent factors (e.g. rw1) replicated by subject to account for the per-subject trajectories. 

The goal is to select the model which best characterizes the variability in yet-unseen subjects without overfitting. To do this, what (I think) I'd like to do is compute the out-of-sample log predictive density for each subject s in 1, ..., S in turn, always conditioning on the data for all subjects except s (or maybe to do this in folds, since there are lots of data and the model fitting is pretty slow). My problem is that I can't figure out how to achieve this given the output and functions that I've found thus far. 

I would also be satisfied with other means of assessing out-of-sample performance at the group/subject/replicate level. (I know that the CPO/PIT values give insight into leave-one-observation-out performance, but it seems that this doesn't generalize to the group level?)

Does anyone have any advice on whether and how this might be possible? I apologize in advance if this question is based on misguided premises, or if the answer is somehow obvious (or both!)

Many thanks,
Jon

INLA help

unread,
Nov 7, 2016, 3:00:55 PM11/7/16
to Jonathan Steinhart, R-inla discussion group
On Mon, 2016-11-07 at 07:34 -0800, Jonathan Steinhart wrote:
> Hello,
>
> I am a newcomer to INLA (which I'm grateful to have discovered
> exists!), and am attempting to use it to model some biological time
> series data. There are S different subjects, from each of whom we
> have one time series (which, incidentally, occur at different and
> irregular time points). I'm modeling them using  some fixed effects
> for covariates like age and weight, random per-subject (iid)
> intercepts, and one or more latent factors (e.g. rw1) replicated by
> subject to account for the per-subject trajectories. 
>
> The goal is to select the model which best characterizes the
> variability in yet-unseen subjects without overfitting. To do this,
> what (I think) I'd like to do is compute the out-of-sample log
> predictive density for each subject s in 1, ..., S in turn, always
> conditioning on the data for all subjects except s (or maybe to do
> this in folds, since there are lots of data and the model fitting is
> pretty slow). My problem is that I can't figure out how to achieve
> this given the output and functions that I've found thus far. 


Hi,

thanks for your email. to check out of sample perforance on groups of
data, you have to call inla several times. each time, you can can zero-
out some of the data setting them to NA, and then compare the observed
data with the predicted ones. so you have to create a loop and call
inla several times. 

you can improve the speed, by starting at the prev fit, like

fit = NULL
for(split in 1:10) {
    fit = inla(..., control.mode = list(result = fit, restart=TRUE),
                    control.predictor = list(compute=TRUE))
    ## compare predictions
...

}


hope this helps,
H

--
Håvard Rue
he...@r-inla.org

Haakon Bakka

unread,
Nov 7, 2016, 3:31:02 PM11/7/16
to INLA help, Jonathan Steinhart, R-inla discussion group
I would recommend running inla many times, and setting subsets to NA as Håvard says. The following is how I make this run fast.

I run the entire model once and use that as control.mode in all the runs "fitting the reduced datasets".

I use the following when I need to do what you do:
inla.call="submit"

To use this you must set up the "remote" use of inla. See 
 under
"I have access to a remote Linux server, is it possible to run the computations remotely and running R locally?"

The nice thing now is that your server can run all the different inla runs in paralell, perfect paralellisation!

And then you use inla.qget afterwards, to retrieve the jobs from the server.

If you are able to set up the use of inla.call="remote", and want to see code for how I use this, let me know!

Haakon




--
You received this message because you are subscribed to the Google Groups "R-inla discussion group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to r-inla-discussion-group+unsub...@googlegroups.com.
To post to this group, send an email to r-inla-discussion-group@googlegroups.com.

Jonathan Steinhart

unread,
Nov 8, 2016, 10:15:33 AM11/8/16
to R-inla discussion group
Dear Håvard and Haakon,

Thanks very much for your advice on how to do the cross-validation by making observations NA -- I have implemented the k-fold calculation of mean log predictive density using this method and it seems to be working as desired. Also, thanks for the pointer to the control.mode option, which definitely speeds up subsequent runs after the first one. I read several times through all of the docs for inla and other core functions to familiarize myself with all of these options, but somehow I had missed this very useful one.

Regarding running on a remote server -- I've been running on the remote server via RStudio Server, but this batch submission via ssh would be more convenient. One question: the server has some outdated system libraries on it, which necessitated the use of the statically-linked INLA binary, which I enabled via inla.setOption(inla.call = paste(d,"/inla.static", sep="")). If I follow the instructions in the FAQ for setting up SSH job submission, will it "just work" or is there something special I must do to make this use the statically-linked binary? (Not wanting to go down this rabbit hole was the only thing keeping me from doing this already! :)

Thanks again for your help.

-Jon


To post to this group, send an email to r-inla-disc...@googlegroups.com.

Haakon Bakka

unread,
Nov 9, 2016, 11:58:35 AM11/9/16
to Jonathan Steinhart, R-inla discussion group
I do not know. It should not be too much work to just try it out (if you have linux/mac). Håvard may know.

I would be curious to see your code on the "comparing the held out data to the predictive distribution", as I am not sure what approach you have chosen. How did you add the observation likelihood? Did you only look at marginal predictions, or did you also do the joint prediction?

Haakon

To unsubscribe from this group and stop receiving emails from it, send an email to r-inla-discussion-group+unsubscr...@googlegroups.com.
To post to this group, send an email to r-inla-disc...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "R-inla discussion group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to r-inla-discussion-group+unsub...@googlegroups.com.
To post to this group, send email to r-inla-discussion-group@googlegroups.com.

Haakon Bakka

unread,
Nov 14, 2016, 5:39:11 AM11/14/16
to Jonathan Steinhart, R-inla discussion group
For your "remote" question:

in .inlarc (that you are setting up when you want to use remote), you must set the correct path to INLA. In your case, you have modified this.

It might be your "paste(d,"/inla.static", sep="")"?

Haakon

Jonathan Steinhart

unread,
Nov 16, 2016, 8:48:30 AM11/16/16
to R-inla discussion group
Haakon,

Sorry about the delayed reply. My code is messy and spread across several files/classes, so I'll describe the approach and please let me know if you'd like to see any actual code and I'll try to put together a more readable standalone example. The thing I have tried to compute (hopefully correctly - I welcome any comments/criticisms!) is the log pointwise predictive density as described in section 2.3 of this paper by Vehtari and Gelman.

Here's the gist of what I'm doing. Suppose I have 50 subjects (each with many observations) in my data.frame (df) and will hold out 10 for each fold (5-fold CV). For the first fold, I then make a copy (y1) of my response variable with all values set to NA for all observations for the 10 test subjects in fold 1. Now I call INLA using y1 as the response, and then pass the fitted model to inla.posterior.sample to get 1000 samples from the joint hyperparameter-and-latent posterior. Suppose I have N total test observations from my 10 test subjects. I can pull out the N simulated values of the linear predictor for my 10 test subjects from each draw and put them all into a 1000xN matrix -- call it y1_sim (since I am using identity link, these are on the y scale). For each posterior draw we also have a corresponding precision, so I put these into a 1000-vector (tau_sim). Finally, let y1_true be the N-vector of true observations in the test fold.

Then (I think!) the estimated log predictive density (aka log likelihood) for the fold should result from the final sum below:

residuals <- sweep(y_sim, 2, y_true) #subtract true values from each row of y_sim
log_dens <- dnorm(residuals, sd=sqrt(1/tau_sim), log=TRUE) #log likelihood of the residual of each draw
lpd <- apply(log_dens, 2, function (col) {log_sum_exp(col) - log(length(col))}) #take mean across draws (rows) for each point, giving pointwise likelihoods; using log_sum_exp in case of underflow
sum(ldp) # joint likelihood for this test set (?)

Of course all of these depends on e.g. my having interpreted the output of e.g. inla.posterior.sample correctly. I'm grateful for any pointers if I'm on the wrong track here!

Cheers,
Jonathan



To unsubscribe from this group and stop receiving emails from it, send an email to r-inla-discussion-group+unsub...@googlegroups.com.
To post to this group, send an email to r-inla-disc...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "R-inla discussion group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to r-inla-discussion-group+unsub...@googlegroups.com.
To post to this group, send email to r-inla-disc...@googlegroups.com.

Haakon Bakka

unread,
Nov 18, 2016, 5:20:27 AM11/18/16
to Jonathan Steinhart, R-inla discussion group
From your code, I understand that you compute for each sample (s)
\pi(y_i | predictor_{i,s}, tau_s),
and multiply each of these together for your fold (because of conditional independence you get joint probability),
and integrate over the posterior (by averaging over the posterior samples).

This looks correct to me.

Also, this way of computing the joint is a very good idea.

The only limitation I see is for "strange data" (data that is "surprising" to the model, the test data where your fit is bad). Here, your sample size may be very small. E.g. if most of the predictive probability comes from, say, 3 samples, and the rest is negligible, then your ESS (effective sample size) is not 1000, but closer to 3. And in this case, the estimated density may be inaccurate.

Haakon

Reply all
Reply to author
Forward
0 new messages