Large-scale posterior prediction

455 views
Skip to first unread message

Joseph Hoover

unread,
Sep 10, 2018, 1:56:33 PM9/10/18
to R-inla discussion group
Hello, 

I am using INLA to estimate large hierarchical and sometimes spatial models. These models have N > 500,000 nested in around 3000 hierarchical units. After a lot of small-scale tests, I have found that I am happy with the estimates generated with Gaussian approximation and EB integration. They are reasonable enough for my purposes, particular given the computational efficiency of this approach. 

So, with that in mind, I am trying to find the most efficient way to use these models to generate a large number (e.g. > 800,000) of predictions. 

Obviously, the canonical approach to prediction with INLA is to add the predictors and NA DV to the data matrix; however, this turns an already big computational problem into a really big one. My machine has 36 GB ram and I cannot fit this all in memory. So, to use this approach I've had to estimate the model several times and batch the predictions, which is hugely inefficient. 

Write now, I'm working on functions that will let me easily sample from the posterior distribution (using inla.posterior.sample) and generate posterior predictions for arbitrary linear combinations of my predictors after fitting the model. 

However, I am worried about the size of the sampled matrix (I haven't tried to run it on a full model, yet), since inla.posterior.sample seems to automatically sample for the linear predictor. For example, for a model with 600,000 observations, 1000 posterior samples for the linear predictor alone will be quite large. 

Accordingly, I am wondering (1) if there is any way to suppress calculating the linear predictor in inla.posterior.sample(). 

However, more generally, I am wondering if there is a more efficient way to solve this problem. And, to, just to be clear, the problem is: 

What is the most efficient way to generate posterior predictions for a large number of arbitrary combinations of predictors, given a large model with a large N. 

And, again, I do not have sufficient RAM to estimate the model and all predictions simultaneously. 

Also, I would greatly prefer to estimate the full posterior variance for the predictions, so the linear predictor alone isn't really sufficient. 

Anyways, any advice on this would be extremely helpful, so thank you all in advance!

Best,
Joe

Georg Heiler

unread,
Sep 10, 2018, 3:12:26 PM9/10/18
to Joseph Hoover, R-inla discussion group
How much RAM do you need?
What about using the cloud with 512GB or 2TB  of RAM? and destroying the VM after use?

Best,
Georg

--
You received this message because you are subscribed to the Google Groups "R-inla discussion group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to r-inla-discussion...@googlegroups.com.
To post to this group, send email to r-inla-disc...@googlegroups.com.
Visit this group at https://groups.google.com/group/r-inla-discussion-group.
For more options, visit https://groups.google.com/d/optout.

Joseph Hoover

unread,
Sep 10, 2018, 3:24:43 PM9/10/18
to R-inla discussion group
Hi Georg, 

Thanks for the thought. That definitely is an option; however, I am working under something of a memory constraint, which is that these models are part of an applied methods paper which prescribes their use for small-area estimation. If at all possible, I am hoping to find a way to solve this problem that doesn't require cloud computing, because a largeish segment of the intended audience will probably not be comfortable with that.

But yea, it very well may come to that. 

Thanks!

Best,
Joe
To unsubscribe from this group and stop receiving emails from it, send an email to r-inla-discussion-group+unsub...@googlegroups.com.

Finn Lindgren

unread,
Sep 10, 2018, 5:51:32 PM9/10/18
to Joseph Hoover, R-inla discussion group
Hi,

I think you’re on the right track with using inla.posterior.sample for this, since that both avoided some internal numerical issues besides the pure size aspect of doing the predictions inside inla().  I don’t know if sampling from the observation predictor can be avoided; possibly not, since it forms part of the internal model construction. For a purely Gaussian model it can be avoided in theory, but that would require writing special code for such models.  Hopefully Haavard will weigh in with more facts on that.

A workaround might be to not ask for all the samples at once, but rather ask inla.posterior.sample to produce only one realisation, and storing only the parts you need from that, and looping through until you have all the samples you want.
Needs some care with preallocating the data structure for storing the bits you want to keep to avoid unnecessary memory reallocations, but in principle this would let you produce an arbitrary amount of samples, provided that _one_ sample can be produced and then (partially) discarded.

Finn
To unsubscribe from this group and stop receiving emails from it, send an email to r-inla-discussion...@googlegroups.com.

Joseph Hoover

unread,
Sep 10, 2018, 7:43:06 PM9/10/18
to R-inla discussion group
Hi Finn, 

Thank you for this! I actually arrived at the same conclusion a little bit ago, so it is nice to have the approach confirmed. 

I am also, it turns out, I am not entirely sure that I understand how to construct the posterior predictive distribution. If I use parameter samples drawn with inla.posterior.predict() to generate a distribution of predictions based on a linear combination of predictors, this provides an approximation of the posterior of the linear predictor, correct? From there, I need to incorporate error variance. In this thread, Håvard suggests this can be done as follows: 

"I would use inla.posterior.sample() to generate (iid) 
samples from the posterior and from this sample y_prd"

However, I am not totally sure what he means by "and from this sample y_prd". What I have done is: 

1) Sample from posterior of linear predictor using inla.posterior.predict()
2) To predict for a new observation x, I use the posterior sample to calculate the linear prediction given x's values, which gives me a distribution of linear predictions.
3) For each sample i from the distribution of linear predictions, draw samples from a Gaussian distribution with mean equal to linear prediction i and SD equal to the estimated SD of the Gaussian observations. 
3.1) Accordingly, if I have drawn 100 samples from the posterior of the linear predictor and I draw 100 samples for each of the linear predictor samples, I wind up with 100 x 100 predictions for the current new observation x. And, as I understand it, this is drawing from the posterior predictive distribution. 
4) To estimate the SD of the posterior predictive distribution, take the SD of all draws from the posterior predictive distribution. 

Does this sound correct? If anyone could verify this (and/or tell me how to do this correctly), I would be again very grateful! 

Thanks!

Best,
Joe


Helpdesk

unread,
Sep 11, 2018, 4:08:57 AM9/11/18
to Joseph Hoover, R-inla discussion group
Hi,

I attach here some ideas, which is to add prediction points later, after the fit. this is discussed in sec 7.4. THIS IS A DRAFT OF A FORTHCOMING BOOK... so it is what it is at this stage.



Also, I would look into the option of using the PARDISO sparse library, which can
factorise in parallel, see

inla.pardiso()

for details. you'll need a license, buts its free for academic use.

using

control.inla = list(strategy = 'adaptive')

is a new option that is as quick as 'gaussian' but better. it simply using
'simplified.laplace' on fixed effects and small-in-size random effects, and 'gaussian' on the rest. the difference between them are generally larger for those, so you'll get a lot for almost no extra cost


for the posterior.sample, you can use option add.names=FALSE, to avoid adding names to all samples, which saves memory.

there is no way to sample just a subset, you need to do all and then save just what you want.

for each call to inla.posterior.sample(), the precision matrix is factorized, so you would like to keep number of such call's small, and then sample as many as you can for each call.




Let me me know how this goes, and if anything of the above helps

Best
H
-- 
You received this message because you are subscribed to the Google Groups "R-inla discussion group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to r-inla-discussion...@googlegroups.com.
To post to this group, send email to r-inla-disc...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
Håvard Rue
Helpdesk 

Elias T. Krainski

unread,
Sep 11, 2018, 9:50:12 AM9/11/18
to r-inla-disc...@googlegroups.com

Hi,

How about an index argument in inla.posterior.sample() specifying which elements of the latent field should be stored. Would be great to have it passed and used by inla.qsample(), reading in memory only the desired part.

Another option would be to have a 'lincomb' A matrix as an argument as an option to retur Ax, the desired function of the latent field.

best,
Elias

Helpdesk

unread,
Sep 11, 2018, 2:58:46 PM9/11/18
to Elias T. Krainski, r-inla-disc...@googlegroups.com

Thanks.

If the issue is to save CPU time, there is currently not that much to
save by this, as the whole vector have be to sampled in any case.

If the issue is to save memory, this is partly possible. if we need a
function ff(sample) and number of samples, n, are high, then
essentially, all the n samples are written to disk. it might be
possible to read one and one of them, and then store (within R) only
ff() of it.

H

Subramanian Swaminathan

unread,
Sep 14, 2021, 2:45:54 AM9/14/21
to R-inla discussion group
Hi,

I have a related problem in making large scale posterior predictions after fitting a spatiotemporal model with covariates to training data set (for 502 blocks over a period of 72 months). I tried validating the fitted model to a test data set (24 months for all 502 blocks).  While doing so I have refitted the model to the training data (Jan 2013 to Dec 2018=72 months & 502 blocks) by including "NA" to the response variable for the validation period (Jan 2019-Dec 2020=24 months for all 502 blocks). This took > 2 weeks to complete the simulations in my workstation with 128GM RAM (zion processor). Is there a way out to minimize the time to validate the model without refitting the model to the training data, as my dataset is very large. And, to, just to be clear, the problem is: 

(1) can I use the model  fitted to training data to make predictions for the validation period without refitting the model to training data again?  I would greatly appreciate your help in this respect.
(2) I also understand from your response below that predictions can be done without refitting the model as given in the URL (https://www.dropbox.com/l/scl/AACju7yt3VLR5CYR3OZLK6FaTvv-ktBK1PE). Unfortunately, I found that the book in the URL is no longer available. Would it be possible to give me access to this URL?

Thanks in advance and regards
S. Subramanian
--

Helpdesk

unread,
Sep 14, 2021, 3:52:05 AM9/14/21
to Subramanian Swaminathan, R-inla discussion group
an example is attached.
he...@r-inla.org
sub.R

Helpdesk

unread,
Sep 14, 2021, 3:54:32 AM9/14/21
to Subramanian Swaminathan, R-inla discussion group

make sure to use PARDISO and set

inla.setOption(inla.mode="experimental")

(you need R-4.1 and the newest testing version)



On Mon, 2021-09-13 at 23:45 -0700, Subramanian Swaminathan wrote:
he...@r-inla.org

Subramanian Swaminathan

unread,
Sep 14, 2021, 8:12:20 AM9/14/21
to R-inla discussion group
Hi,

Many thanks for your quick response. I tried to run the code 'sub.r' and found, indeed, it is really helpful. However, I am not clear about repeating the 'r.block' 'k times'. is it that for getting 'k' repeats of  'y.pred[ii]' for the out of sample period, if so, how the final predicted  values are calculated from the 'k' repeats for the out of sample period. Would you be kind enough to clarify this? Thhis is very much helpful to resolve our predictions for out of sample period.

Thanks once again for the great help!
S. Subramanian

Subramanian Swaminathan

unread,
Sep 15, 2021, 8:06:04 AM9/15/21
to R-inla discussion group
Hi,

I tried installing the academic version of PARDISO. But I understand that it is available only for MACOS and LINUX. I wonder whether PARDISO is also available for Windows 10?

S. Subramanian

On Tuesday, September 14, 2021 at 1:24:32 PM UTC+5:30 he...@r-inla.org wrote:

Helpdesk

unread,
Sep 22, 2021, 2:07:37 PM9/22/21
to Subramanian Swaminathan, R-inla discussion group
I'll try to make this happen.
Reply all
Reply to author
Forward
0 new messages