grouped rw2, cyclic and scaled?

238 views
Skip to first unread message

Scott Burman

unread,
Aug 31, 2021, 6:14:37 PM8/31/21
to R-inla discussion group
Hello,

I've been finalizing my models and in particular, I have taken advantage of grouping. 

My model is multi-leveled in that I have a number of groups of sensors, within each such group are five individual sensors. I used a model matrix, using the "z" model type (ala Gomez-Rubio). I then have an IID term for which part of the stream a sensor is in (main or side), and within this, I have a grouped IID term for wethere the sensor is upstream or downstream. Finally, I have a temporal component, which is day of sampling, and grouped in this term is hour of day. Both are rw2. The time of day is set as cyclic, which requires turning off "scale.model."

In essence, it appears as follows (Variables are in all caps)
formula<-sensor_reading ~ 0+f(SENSOR-NAME,model="iid",hyper=hypiid) + f(SUB_SENSOR#, model="z", Z=zSEN_NUM, hyper=hypiid) + f(HORIZONTAL_LOCATION, model="iid", hyper=hypiid, group=UP/DOWNSTREAM, control.group=list(model="iid",hyper=hypiid, scale.model=TRUE)) + f(SAMPLING_DAY, model=rw2, cyclic=FALSE, scale.model=TRUE, values=seq(0,last_day), hyper=hyperprec, diagonal=1e-4, constr=FALSE, group=HOUR_OF_DAY, control.group=list(model=rw2, cyclic=TRUE, hyper=hyperprec, scale.model=FALSE, adjust.for.con.comp=TRUE)) 

I guess I'm looking for some insights about what the ramifications of not scaling the latent effects would be and to ask if there is any plan to allow group effects to bhave scaled latent effects and be cyclic.

My model outputs look excellent, the data fits exceptionally well with the linear predictions (a very linear diagonal with slope=1), and the residuals follow a horizontal line at 0 well. My DIC is surprisingly low, and my CPO and PIT both look good. Yet, when I generated a validation set by leaving out 1/5th of my data (one randomly selected sensor per sensor group), changing them to NA, and then allowing INLA to compute the NAs, only about 1/5th of this validation data is within the 95% CI. I fear overfitting. Could this be related to not scaling the hour of day term?

I have run about 20-30 different variations of this model, and grouping hour within sampling day provides considerably better fits than an interaction effect, or treating them as separate effects. I am wondering if this excellent fit is the result of my not scaling that effect.

Any insight would be helpful.

Thanks!

Finn Lindgren

unread,
Aug 31, 2021, 6:46:15 PM8/31/21
to Scott Burman, R-inla discussion group
Hi,

you may have already considered this, but I have to ask since it's such a common conceptual mistake:
The linear predictor output (and link-funcion transformed version in fitted.values) give the credible intervals for the linear predictor for "known" observations, and prediction intervals for the linear predictor for the "missing" observations (they are both prediction intervals for the linear predictor), and _not_ for the observation model itself. For that, you need to take the observation distribution into account as well.  The easiest general way to do that is to generate samples of the predictor with inla.posterior.sample, and then sample the observation model conditionally on that.

In other words, if m is the observation expectation conditionally on the latent variables (i.e. the predictor), the
  E(Y|data) = E(E(Y|m,data)) = E(m|data)
and
  Var(Y|data) = E(Var(Y|m,data)) + Var(E(Y|m,data)) = E(Var(Y|m,data)) + Var(m|data)
The fitted.values output gives you E(m|data) and Var(m|data), but not E(Var(Y|m,data)).
That last term can sometimes be obtained from the hyperparameter summary, but if you want more accurate prediction intervals for Y you need more (that you can get via sampling).

Finn

--
You received this message because you are subscribed to the Google Groups "R-inla discussion group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to r-inla-discussion...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/r-inla-discussion-group/dc30a7da-b637-4613-b23d-1b4575169b9fn%40googlegroups.com.


--
Finn Lindgren
email: finn.l...@gmail.com

Scott Burman

unread,
Aug 31, 2021, 6:57:33 PM8/31/21
to R-inla discussion group
You are right that I had not considered that. I'm attempting to finish my dissertation, so I'll admit -- I haven't slept much recently and am struggling to wrap my head around how to best go about doing this. Do you have any texts or examples that give some insights about how to sample the observation model conditionally on the posterior samples?

Thanks!

Finn Lindgren

unread,
Sep 1, 2021, 8:38:18 AM9/1/21
to R-inla discussion group
Hi,
only have time for a brief answer now:
First, you need the option control.compute=list(config=TRUE) when running your inla() call.
Then, call inla.posterior.sample and take a look at the output. It will have various elements from your model containing samples from the posterior.
Extract the predictor information that gives you the parameter(s) of your observation model, and then call "rsomething" (rnorm for gaussian observations) to generate a sample of that.
Do that for each sample in the inla.posterior.sample output. You can also be helped in this with the inla.posterior.sample.eval function.
There have been a few threads on the list recently with some examples, I think.

Finn

Scott Burman

unread,
Sep 1, 2021, 6:29:34 PM9/1/21
to R-inla discussion group
This is extremely helpful, thank you. I will play with this tonight.

Scott Burman

unread,
Sep 1, 2021, 6:36:17 PM9/1/21
to R-inla discussion group
Finn,

I've not yet gotten models to work with control.compute=list(config=TRUE). I've had to turn that off or I get a segfault. I do not know if I have tried it with the latest version, but I will tonight.

Thanks
Scott

Helpdesk

unread,
Sep 2, 2021, 12:44:08 AM9/2/21
to Scott Burman, R-inla discussion group

segfault is most likely due to lack of memory. the
inla.mode="experimental" option will help (R-4.1 & latest testing
version), as well as a (Linux) machine with more RAM. please enable the
PARDISO library. you may also test adding
control.inla=list(int.strategy="eb") to use only one point, at least
being able to move forward to check things out



On Wed, 2021-09-01 at 15:36 -0700, 'Scott Burman' via R-inla discussion
Håvard Rue
he...@r-inla.org

Scott Burman

unread,
Sep 2, 2021, 12:48:04 AM9/2/21
to Helpdesk, R-inla discussion group
It's a compute cluster with 256gb ram/node. I'll try again and maybe drop the thread count a bit.

I'm already using the latest experimental. I'm not sure if I tried config=true since moving to the nightly and using experimental. 

Thanks! 
Scott

Scott Burman

unread,
Sep 2, 2021, 5:01:37 AM9/2/21
to R-inla discussion group
How long should inla.cpo take relative to the original model fit? My models contain about 200,000 pieces of data. Fitting the models often takes about 20-40 minutes. Running inla.cpo(model, force=TRUE) is taking many hours, currently close to seven hours.

Thanks
Scott

Helpdesk

unread,
Sep 2, 2021, 5:39:24 AM9/2/21
to Scott Burman, R-inla discussion group

Normally, it would not be needed to call 'inla.cpo' as the cpo values
are generally quite good (even with 'faliure=1', 'faliure' was a badly
chosen work at the time...).

inla.cpo is like a pendantic option, which compute *each cpo value* with
failure=1, manually, one at the time, which can be simply waaaaay to
much.

On Thu, 2021-09-02 at 02:01 -0700, 'Scott Burman' via R-inla discussion

Scott Burman

unread,
Sep 2, 2021, 5:53:39 AM9/2/21
to R-inla discussion group
So instead, I've changed my code to do the following:

if sum(model$cpo$failure)==0) 
then
model$cpo$failure[!is.na(model$cpo$cpo) & model$cpo$cpo<1e-20]=1
inla.cpo(model)

Is this a reasonable approach -- only run inla.cpo with the smallest CPO values removed?

Thanks
Scott

Helpdesk

unread,
Sep 2, 2021, 6:09:20 AM9/2/21
to Scott Burman, R-inla discussion group

The question is more if you really need to call inla.cpo, even if
'failure=1'

On Thu, 2021-09-02 at 02:53 -0700, 'Scott Burman' via R-inla discussion

Scott Burman

unread,
Sep 2, 2021, 6:13:02 AM9/2/21
to R-inla discussion group
Interesting, thank you. I appreciate your time and prompt responses.

Scott Burman

unread,
Sep 15, 2021, 7:22:26 PM9/15/21
to R-inla discussion group
I've gotten models to run with config=TRUE, but when running inla.posteror.sample, I get an error that one (or more) core(s) are not returning results, which may affect the results.

Can  inla.posterior.sample be run in parallel, or does it need to be run with just one thread?
My models are quite large, and generating samples with n=1 took considerable time. 

I have done a bit of searching in the forum and did not find much in the way of examples of out-of sample validations. Without having seen the output of a inla.posterior.sample, it's difficult to conceptualize the next steps once I have the posterior samples.

Thanks

Scott Burman

unread,
Sep 15, 2021, 7:56:55 PM9/15/21
to R-inla discussion group
I apologize for the second quick reply, but I am struggling to understand the distinction between the example given in chapter 12 of Gomez-Rubio, which fills in missing responses using the model$summary.fitted.values (https://becarioprecario.bitbucket.io/inla-gitbook/ch-missing.html#ch:missing).
I know I'm missing something.

If I'm understanding correctly, Gomez-Rubio samples the observational distribution, parameterized by the values from summary.fitted.values for the missing values. These would then define a distribution, and using  something like qnorm(c(0.025,0.5,0.95), mean=summary.fitted.values$mean,sd=summary.fitted.values$sd), then evaluate if the observed value is within these quantiles?

I guess I'm not understanding the need to use inla.posterior.sample. I apologize in advance for the obtuse question, and appreciate your time.

Thanks

Helpdesk

unread,
Sep 16, 2021, 4:16:18 AM9/16/21
to Scott Burman, R-inla discussion group

the fitted values are often, but not always, what is used to predict
missing data, like the 'mean' and the associated 'stdev'. its all about
having information about the linear predictor and try to use that to
predict a missing datapoint. you can do that from the linear.predictor
or fitted.values ouput, or by using posterior.sample().

the sampling should work, if it does not, make sure try with R-4.1 and a
new testing version. if memory is a concern, then try adding

inla(..., inla.mode="experimental")



On Wed, 2021-09-15 at 16:56 -0700, 'Scott Burman' via R-inla discussion
Reply all
Reply to author
Forward
0 new messages