predict() for large datasets with inlabru

123 views
Skip to first unread message

Virgilio Gómez Rubio

unread,
Feb 27, 2024, 6:05:58 AM2/27/24
to R-inla discussion group
Hi,

I am using inlabru to make some prediction on a very large dataset (millions of registers) using the generate() function to process the output. As the dataset is very large I make the prediction by splitting the dataset into smaller chunks that I can process separately. I know that the ‘seed' argument can be used to obtain the same sample from the model parameters but I have not been able to find the way to simulate the sample from the posterior and use it to save time. I am aware of inla.posterior.sample but I would like to keep all the pipeline within the inlabru package so that I can rely on the definition of the inlabru components.

Best,

Virgilio

Finn Lindgren

unread,
Feb 27, 2024, 6:11:06 AM2/27/24
to Virgilio Gómez Rubio, R-inla discussion group
Hi,

It’s not clear exactly what you’re asking; the inlabru generate() function internally calls inla.posterior.sample(), and either returns the samples of the model parameters and latent variables, or returns samples of a computed prediction expression.

Since you mention “seed” I’m guessing the issue is to get repeatable samples? To the extent that this is possible with inla.posterior.sample, you must set _both_ the R seed with set.seed(), _and_ the seed argument, as inla uses both R random numbers and its own internal generator.

Finn

> On 27 Feb 2024, at 11:05, 'Virgilio Gómez Rubio' via R-inla discussion group <r-inla-disc...@googlegroups.com> wrote:
>
> Hi,
>
> I am using inlabru to make some prediction on a very large dataset (millions of registers) using the generate() function to process the output. As the dataset is very large I make the prediction by splitting the dataset into smaller chunks that I can process separately. I know that the ‘seed' argument can be used to obtain the same sample from the model parameters but I have not been able to find the way to simulate the sample from the posterior and use it to save time. I am aware of inla.posterior.sample but I would like to keep all the pipeline within the inlabru package so that I can rely on the definition of the inlabru components.
>
> Best,
>
> Virgilio
>
> --
> You received this message because you are subscribed to the Google Groups "R-inla discussion group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to r-inla-discussion...@googlegroups.com.
> To view this discussion on the web, visit https://groups.google.com/d/msgid/r-inla-discussion-group/E1DE84A4-9111-4073-8642-7A807658B803%40uclm.es.

Virgilio Gómez Rubio

unread,
Feb 27, 2024, 11:01:24 AM2/27/24
to Finn Lindgren, R-inla discussion group
Hi Finn,

Thanks for your answer. 

It’s not clear exactly what you’re asking; the inlabru generate() function internally calls inla.posterior.sample(), and either returns the samples of the model parameters and latent variables, or returns samples of a computed prediction expression.

What I mean is whether I can keep the samples from the model parameters and latent variables and then evaluate them to get the prediction on my data chunks separately using inlabru’s components. This way I can get the  samples from the model parameters and latent variables once and then use them on my data chunks. Using pure INLA this will be something like calling inla.posterior.sample() and then using the sample with inline.posterior.sample.eval() to obtain the estimates for all my data chunks. However, I have my model defined using inlabru’s components.

Since you mention “seed” I’m guessing the issue is to get repeatable samples? To the extent that this is possible with inla.posterior.sample, you must set _both_ the R seed with set.seed(), _and_ the seed argument, as inla uses both R random numbers and its own internal generator.

Thanks for clarifying this as well. 

Best,

Virgilio

Finn Lindgren

unread,
Feb 27, 2024, 11:52:45 AM2/27/24
to R-inla discussion group
Hi Virgilio,

I think I understand; you want to generate and store samples of the
parameters and latent vectors _without_ computing the component
effects, and then _later_ compute the component effects and predictor
expressions?
Yes, if you look at the code for generate.bru(), you'll see that it
first generates the samples, and then only if a predictor formula has
been provided, calls evaluate_model() with "newdata" to evaluate the
component effects, and compute the predictor expression. You could do
the same at a later stage by _not_ providing a formula to generate,
and then later directly calling evaluate_model() in the same way as
generate.bru() does, using the generate() output as "state".

For completeness: the bru predict() method simply calls generate(),
and then a method that computes summary statistics. I plan to make
that summarisation function a supported exported method as well, but
currently that's a more internal part of the predict() function
itself.

"Chunking" the generate() behaviour is something I've wanted to do for
a long time, including using recursive quantile estimation methods in
predict(), to allow a smaller memory footprint when using a large
number of samples, but splitting the parameter/component generation
from the effect&predictor computation I hadn't really considered, even
though as said above is something that should simply work, by doing
exactly what generate does when you do supply a formula and "newdata".

Finn
--
Finn Lindgren
email: finn.l...@gmail.com

Virgilio Gómez-Rubio

unread,
Feb 27, 2024, 12:36:02 PM2/27/24
to Finn Lindgren, R-inla discussion group
Hi Finn,

Many thanks for your answer. Yes, that is exactly want I wanted to do! I have tried evaluate_model() but I am uncertain about what ‘input’ should be. May I ask you what I should put there?

I think that chunking the prediction will be good to be able to process data in parallel. In particular, I am using it right now to obtain estimates at ~2500 areas using individual/household data, which is a large dataset.

Best,

Virgilio

Finn Lindgren

unread,
Feb 27, 2024, 12:39:29 PM2/27/24
to Virgilio Gómez-Rubio, R-inla discussion group
I'll need to track the "input" information through the generate() code to be sure; it might be optionally precomputed information, since I think it involves information from newdata, but I'm not sure where its computed without looking at the code; tonight or tomorrow.
Finn

Finn Lindgren

unread,
Feb 28, 2024, 9:19:22 AM2/28/24
to Virgilio Gómez-Rubio, R-inla discussion group
I've checked the code now.
generate.bru calls evaluate_model with

vals <- evaluate_model(
model = object$bru_info$model,
state = state,
data = newdata,
predictor = formula,
used = used
)

which doesn't use the "input" argument. The "input" argument to
evaluate_model() may be used elsewhere in the code where
evaluate_model is called repeatedly for individual state vectors, to
avoid having to recompute the "input" values when the data/newdata
remains unchanged. In your use case, you will supply different data
each time, so it wouldn't benefit from having that precomputed.

Finn

Virgilio Gómez-Rubio

unread,
Mar 1, 2024, 11:14:10 AM3/1/24
to Finn Lindgren, R-inla discussion group
Hi Finn,

Many thanks for this. I have been able to split the prediction process using generate() and eval_model() and it works like a charm! I am able to save about 40 seconds (out ~140 seconds originally) when processing each of the ~120 chunks. And, most importantly, I am using exactly the same sample from the posterior when making all the predictions.

Best,

Virgilio

Finn Lindgren

unread,
Mar 1, 2024, 11:17:53 AM3/1/24
to Virgilio Gómez-Rubio, R-inla discussion group
Great!
Also note that the inlabru generate()/predict() code allows some
"unorthodox" features, such as returning/computing a data.frame or
list of multiple different predictors, e.g.
formula = ~ {list(A = component1, B = component2, C = component1+component2)}
would compute the predictions for both the individual components, and
for their sum)

Finn
Reply all
Reply to author
Forward
0 new messages