Spatially varying logistic regression with large data - INLA is crashing

538 views
Skip to first unread message

Boris

unread,
Aug 6, 2018, 5:52:58 AM8/6/18
to R-inla discussion group
Dear INLA enthusiasts,

I am struggling with the implementation of a spatially varying logistic regression for very large data.

The research question:

 I want to explore which population group in which locations is most at risk for a specific disease. As I have information on individual level, I want to use logistic regression. However, I need to account for regional differences as I have very large urban areas and sparsely populated rural areas, so you may expect that e.g. migration background and availability of healthcare are more important in urban areas than in rural areas. That`s why the model of choice ultimately is a spatially varying logistic regression.

The data:

The dataset consists of 1.6 million anonymized persons distributed across 1,600 municipalities. For each person I have individual information (e.g. age, gender, employment status, migration background, previous diseases and so on) and several variables at the municipality level (e.g. one-person-households, availability of healthcare). The outcome variable - the disease in question is a binary variable coded as 1 - has the disease - 0 does not have the disease. The variables gender, employment status and migration background are also coded as binary variables (0/1). Age, number of previous disease and percentage of one-person-households are continuous variables. I ensured that the data is formatted how INLA needs it (e.g. area ids of the data match with the ids of the graph file).

The approach:

First I created a non-spatial logistic regression, which does its job as expected. Then I proceeded to the Bayesian SVC model using the following formula:


model_svc <- inla(DISEASE~   1  +
                    f(idx1, female, constr = TRUE, model = "besag",
                      graph = "graph_v3") +
                    f(idx2, nr_of_previous_diseases, constr = TRUE, model = "besag",
                      graph = "graph_v3") +
                    f(idx3, unemployed, constr = TRUE, model = "besag",
                      graph = "graph_v3") +
                    f(idx4, migration_background, constr = TRUE, model = "besag",
                      graph = "graph_v3") +
                    f(idx5, one_person_households, constr = TRUE, model = "besag",
                      graph = "graph_v3") +
                    f(idx6, GPs_available, constr = TRUE, model = "besag",
                      graph = "graph_v3") +
                  f(idx7, age, constr = TRUE, model = "besag",
                    graph = "graph_v3"),
                  family = "binomial", data=data, control.compute = list(dic=T),
                  control.inla = list(diagonal = 1000, strategy = "gaussian", int.strategy = "eb"))

If I run the model with a small subset of 20.000 persons, it works, but if I run the model with a larger sample of 50.000 persons, INLA crashes. Ultimately, I need to run the analysis with all 1.6 million persons. Based on some suggestions, I included "control.inla = list(diagonal = 1000, strategy = "gaussian", int.strategy = "eb"). This however, alters only the processing time when run for a small sample but INLA still crashes for a larger sample, regardless whether I use the diagonal argument or not. I will admit that I don`t know yet how to calculate starting values or determine reasonable precision parameters if that plays a role for this problem. I have a server with 32CPU and 256 GB memory, so my available hardware is not the issue. Unfortunately, I cannot provide sample data as restrictions apply.

What should I do to make the model run for all 1.6 million persons?

Any help in this regard is greatly appreciated.

Best, Boris


Bob O'Hara

unread,
Aug 6, 2018, 6:01:05 AM8/6/18
to Boris, R-inla discussion group
I've run big analyses like this, and had crashes. The simple, if
unhelpful, answer is to get more memory. Another solution might be to
ignore the spatial effect and fit the random slopes as iid (or even as
fixed effects, as interactions with region). If this works, it should
help you simplify the model by showing you which effects do not vary
spatially, and hopefully that will remove a few spatial effects from
the final model.

Bob
> --
> You received this message because you are subscribed to the Google Groups
> "R-inla discussion group" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to r-inla-discussion...@googlegroups.com.
> To post to this group, send email to
> r-inla-disc...@googlegroups.com.
> Visit this group at https://groups.google.com/group/r-inla-discussion-group.
> For more options, visit https://groups.google.com/d/optout.



--
Bob O'Hara
Institutt for matematiske fag
NTNU
7491 Trondheim
Norway

Mobile: +47 915 54 416
Journal of Negative Results - EEB: www.jnr-eeb.org

Haakon Bakka

unread,
Aug 6, 2018, 6:27:25 AM8/6/18
to Bob O'Hara, Boris, R-inla discussion group
Hi,

Can you attach the output of verbose=T in a text file?

Sometimes, it is lack of memory, but 50 000 rows is not a very large dataset (maybe "medium large"). The size of your model (that we can see from the verbose output) is more important.

From your formula I see that you have several spatial effects. If these are not identifiable from your data, that can cause an inconsistency that can cause a crash.

I strongly suggest adding pc priors! These also make compuattions more stable.

Kind regards,
Haakon Bakka




> To post to this group, send email to
--
Bob O'Hara
Institutt for matematiske fag
NTNU
7491 Trondheim
Norway

Mobile: +47 915 54 416
Journal of Negative Results - EEB: www.jnr-eeb.org
--
You received this message because you are subscribed to the Google Groups "R-inla discussion group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to r-inla-discussion-group+unsub...@googlegroups.com.
To post to this group, send an email to r-inla-discussion-group@googlegroups.com.

Boris

unread,
Aug 6, 2018, 8:04:52 AM8/6/18
to R-inla discussion group
Many thanks for your replies!

I have attached the output of verbose = T. Theoretically, I should have enough memory with 256 GB RAM. I have also tried to run a spatially varying regression with only one variable of which I know a spatial effect should exist, but INLA still crashes for larger data. So the issue seems to be somewhere else.

As I am not too familiar with PC priors, could you suggest how to add this to my formula? I have seen the description hyper = list(<theta> = list(prior="pc.prec", param=c(<u>,<alpha>))) but I do not know how to specify reasonable values for u and alpha here as I went mostly with the default settings.

@Bob O`Hara: I have tried it also with an "iid" prior instead of a "besag" prior, but INLA still crashes, unfortunately.



verbose.txt

Haakon Bakka

unread,
Aug 6, 2018, 8:18:24 AM8/6/18
to Boris, R-inla discussion group
To me, it looks like the crash happens before the algorithm has completed the initialisation and setup step.

I would double-check your model setup. You can also try to run it on some simple Gaussian simulated data. At least check every component of your formula with separate inla calls, with real or simulated data.

To check the memory you could look at the processor information (e.g. top) to see how far it goes before it crashes.

Your model and data is complex enough that there is a lot of things to check that might be coded wrong. Try to find the simplest possible version of your code that still crashes, and let us know.

Make sure every variable is a simple numeric variable.

Haakon

Haakon Bakka

unread,
Aug 6, 2018, 8:19:47 AM8/6/18
to Boris, R-inla discussion group
For the pcprior, as a first try you can use
scale.model = T
and
hyper = list(theta = list(prior="pc.prec", param=c(1,0.01)))

Haakon


On 6 August 2018 at 15:18, Haakon Bakka <ba...@r-inla.org> wrote:
To me, it looks like the crash happens before the algorithm has completed the initialisation and setup step.

I would double-check your model setup. You can also try to run it on some simple Gaussian simulated data. At least check every component of your formula with separate inla calls, with real or simulated data.

To check the memory you could look at the processor information (e.g. top) to see how far it goes before it crashes.

Your model and data is complex enough that there is a lot of things to check that might be coded wrong. Try to find the simplest possible version of your code that still crashes, and let us know.

Make sure every variable is a simple numeric variable.

Haakon
On 6 August 2018 at 15:04, Boris <bo.k...@gmail.com> wrote:
Many thanks for your replies!

I have attached the output of verbose = T. Theoretically, I should have enough memory with 256 GB RAM. I have also tried to run a spatially varying regression with only one variable of which I know a spatial effect should exist, but INLA still crashes for larger data. So the issue seems to be somewhere else.

As I am not too familiar with PC priors, could you suggest how to add this to my formula? I have seen the description hyper = list(<theta> = list(prior="pc.prec", param=c(<u>,<alpha>))) but I do not know how to specify reasonable values for u and alpha here as I went mostly with the default settings.

@Bob O`Hara: I have tried it also with an "iid" prior instead of a "besag" prior, but INLA still crashes, unfortunately.



--
You received this message because you are subscribed to the Google Groups "R-inla discussion group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to r-inla-discussion-group+unsubscr...@googlegroups.com.

Helpdesk

unread,
Aug 6, 2018, 8:52:52 AM8/6/18
to Boris, R-inla discussion group
I can also try to run it here to check. in that case, I'll need code to
reproduce it, alternatively, the directory 'inla.model' created with
option keep=TRUE
> --
> You received this message because you are subscribed to the Google
> Groups "R-inla discussion group" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to r-inla-discussion...@googlegroups.com
> .
> To post to this group, send email to
> r-inla-disc...@googlegroups.com.
Håvard Rue
Helpdesk
he...@r-inla.org

Boris

unread,
Aug 6, 2018, 11:35:00 AM8/6/18
to R-inla discussion group

Many thanks for your replies.

I followed the suggestion to include the PC prior with the parameters suggested. This allowed me to increase the sample to 160,000 persons - so indeed an enhancement (still far from the 1.6 million I need to run the analysis for). However, when I try to tun the model with more than that it still crashes. I also ensured that each variable is numeric. The largest possible subset of 160,000 rows consumes about 20GB of RAM, so I still have plenty available from my 256 GB in total.

I have tried for several hours now different combinations e.g. less variables, with iid prior and without, binary explanatory variables coded as numeric or factor etc. but for a large number of rows the model still crashes quite early on. The last thing INLA tries is to do is the Gaussian approximation before it crashes.

Is there anything else I could try to make it work?

@Dr. Rue, many thanks for the offer - the code I ultimately used now is below. This one worked for 160,000 rows. Unfortunately, I cannot share the data itself as privacy protection restricts the distribution of the data - is there any other way for you to get a closer look?

model_svc <- inla(DISEASE ~ 1 + 
  f(idx1, female, constr = TRUE, model = "besag",
    graph = "graph_v3", scale.model = T, hyper = list(theta = list(prior="pc.prec", param=c(1,0.01)))) +
  f(idx2, age, constr = TRUE, model = "besag",
    graph = "graph_v3", scale.model = T, hyper = list(theta = list(prior="pc.prec", param=c(1,0.01)))) +

  f(idx3, unemployed, constr = TRUE, model = "besag",
    graph = "graph_v3", scale.model = T, hyper = list(theta = list(prior="pc.prec", param=c(1,0.01)))) +

  f(idx4, migration_background, constr = TRUE, model = "besag",
    graph = "graph_v3", scale.model = T, hyper = list(theta = list(prior="pc.prec", param=c(1,0.01)))) +
  f(idx5, p_one-person-households, constr = TRUE, model = "besag",
    graph = "graph_v3", scale.model = T, hyper = list(theta = list(prior="pc.prec", param=c(1,0.01)))) +
  f(idx6, nr_of_previous_diseases_Er, constr = TRUE, model = "besag",
    graph = "graph_v3", scale.model = T, hyper = list(theta = list(prior="pc.prec", param=c(1,0.01)))) +
  f(idx7, GPs_available, constr = TRUE, model = "besag",
    graph = "graph_v3", scale.model = T, hyper = list(theta = list(prior="pc.prec", param=c(1,0.01)))),
  f(idx8, model = "iid", hyper = list(theta = list(prior="pc.prec", param=c(1,0.01)))),
family = "binomial", data=daten1, control.compute = list(dic=T),
control.inla = list(diagonal = 10000, strategy = "gaussian", int.strategy = "eb"), verbose = TRUE)

Many thanks in advance,

Boris

jpkr...@gmail.com

unread,
Aug 7, 2018, 1:51:06 AM8/7/18
to R-inla discussion group
Hi Boris,

So I don't' think I can help you on the technical side of making this model run, but, something does strike me about the model specification on the setting of the data description you gave.

You said that you have individual level data (e.g. gender) and municipality level data (e.g. one person households). Yet in your model you are modelling both gender and one person households as a besag model with graph_v3 as the neighbourhood graph. To me this begs the questions: 1) How are you generating this graph_v3?, and 2) why are you applying the same graph geometry to individual level data, and municipality level data? In my experience (admittedly limited), the besag model is applied at the municipality level, since it represents the spatially structured effects by areas, so I don't understand why you are applying it to gender. Also, to specify gender as a spatially structured variable in the first place implies that you believe the effect of gender on disease risk varies by location. Do you have a good a-priori reason to believe this? Because without a disease specific prior belief for this, I would find this unusual. Similar comment for age and number of previous diseases - is their a good reason to believe there are spatially structured effects from such variables?

I have a suspicion here that what you actually want to do is model "fixed effects" for the individual level variables and spatially structured effects for the municipal level variables. In which case your model would look simpler - something like: 

model_svc <- inla(DISEASE ~ 1 +  female + age + unemployed + migration_background + nr_of_previous_diseases_Er + 

  f(idx5, p_one-person-households, constr = TRUE, model = "besag",
    graph = "graph_v3", scale.model = T, hyper = list(theta = list(prior="pc.prec", param=c(1,0.01)))) +
  f(idx7, GPs_available, constr = TRUE, model = "besag",
    graph = "graph_v3", scale.model = T, hyper = list(theta = list(prior="pc.prec", param=c(1,0.01)))),
  f(idx8, model = "iid", hyper = list(theta = list(prior="pc.prec", param=c(1,0.01)))),
family = "binomial", data=daten1, control.compute = list(dic=T),
control.inla = list(diagonal = 10000, strategy = "gaussian", int.strategy = "eb"), verbose = TRUE)


Perhaps such a model will be easier to fit on 1.6million people ? (note I am assuming graph_v3 depicts the neighborhood graph of the municipalities)

James

Helpdesk

unread,
Aug 7, 2018, 2:51:20 AM8/7/18
to jpkr...@gmail.com, R-inla discussion group
If you're using an updated R-INLA version, then also check out the new
sparse engine,

inla.pardiso()

which, if you run in pardiso.parallel model, should be able to go high
in dimension...

try using

control.inla = list(int.strategy = "eb")

as well, while testing.

Boris

unread,
Aug 7, 2018, 7:02:22 AM8/7/18
to R-inla discussion group

Dear all,

many thanks for your replies! James, I have tried to run a global logistic regression model with an iid and besag prior as well - the data seems generally too large to do so for some reason. But I can now pinpoint the crash better - it happens directly before the optimization strategy. The reason I work at the municipality level is that this is the smallest unit I can analyse on individual level without violating privacy protection of the individual to answer your question.

Dear Havard,

many thanks for the suggestion with inla.paradiso. I have tried and signed up for it but it is currently not available for windows as stated in the message that pops up in R (sadly). I would have loved to try this option out. I will contact Olaf Schenk and ask whether there might be any workaround for this one. I am working on a windows server, so my company does not have Linux or Apple server.

I will have to see what else I can do to make the analysis work. The INLA package is really sophisticated; I just wished it would work with larger data as well....


Boris

unread,
Aug 7, 2018, 7:17:41 AM8/7/18
to R-inla discussion group

These are the last lines before it crashes, if that is of any help. Seems to try to index the table and crashes:

		prior mean=[0]
		prior precision=[0.001]
		compute=[1]
		output:
			summary=[1]
			return.marginals=[1]
			nquantiles=[3]  [ 0.025 0.5 0.975 ]
			ncdf=[0]  [ ]
	Index table: number of entries[11], total length[564811]
		tag                            start-index     leError in inla.inlaprogram.has.crashed() : 
  The inla-program exited with an error. Unless you interupted it yourself, please rerun with verbose=TRUE and check the output carefully.
  If this does not help, please contact the developers at <he...@r-inla.org>.
In addition: Warning message:
running command '"C:/Users/Documents/R/win-library/3.4/INLA/bin/windows/64bit/inla.exe"  -b -s -v  "C:\Users\AppData\Local\Temp\2\RtmpoPnlHs\fileb7284ca0527b/Model.ini"' had status 253

Helpdesk

unread,
Aug 7, 2018, 7:47:58 AM8/7/18
to Boris, R-inla discussion group
On Tue, 2018-08-07 at 04:02 -0700, 'Boris' via R-inla discussion group
wrote:
>
> many thanks for the suggestion with inla.paradiso. I have tried and
> signed up for it but it is currently not available for windows as
> stated in the message that pops up in R (sadly). I would have loved
> to try this option out. I will contact Olaf Schenk and ask whether
> there might be any workaround for this one. I am working on a windows
> server, so my company does not have Linux or Apple server.


I have the windows version of the library, but with another compiler
and I havn't made it work yet...

you can run a linux virtual machine in your windows box though...

jpkr...@gmail.com

unread,
Aug 8, 2018, 4:47:08 AM8/8/18
to R-inla discussion group
Ok thanks Boris.

What I'm still wondering is how big is your graph? If you do nrow( graph_v3) - does it say 1,600 or does it say 1.6million ?

James

Boris

unread,
Aug 9, 2018, 6:51:32 AM8/9/18
to R-inla discussion group

Dear all,

I further managed to narrow the problem down: It has something to do with INLA running on a server.

When I run a large (1.6. mio rows)  model on my virtual machine, it works perfectly fine. The same code however, causes INLA to crash with large data on my server.  On my windows server, I have installed R 3.5.1 and the latest INLA version. Is there anything I should take particularly into account when running R / Rstudio / INLA from a server such as installation routines, packages required, or installation location? I have also tried to run it based on R 3.4.1 (which I have on my virtual machine and which works), but this did not resolve it either. I have it installed in drive C where all programs are.

 Now that I managed to narrow the problem down to the installation on a server being the issue, I hope to be able to resolve it somehow.... Any idea here is welcome.

@James, the graph is 1600 rows long (the number of municipalities). Since it works on my VM, I can rule any dataspecific issues out.

Many thanks in advance.

Boris

Helpdesk

unread,
Aug 9, 2018, 6:56:13 AM8/9/18
to Boris, R-inla discussion group
and your virtual machine runs linux? server runs windows?

On Thu, 2018-08-09 at 03:51 -0700, 'Boris' via R-inla discussion group
wrote:
>

Boris

unread,
Aug 9, 2018, 8:09:31 AM8/9/18
to R-inla discussion group

Both, the virtual machine and the server run on windows. When installing INLA I get the error that "Rgraphviz" is unavailable. Similarly, "lib is unspecified" - packages are installed in D\:documents if that helps. INLA works, though, it just`can`t handle the same amount of data as the virtual machine...

Boris

unread,
Aug 20, 2018, 6:37:43 AM8/20/18
to R-inla discussion group

Dear all,

I managed to make it work. The reason why such a large model worked on my virtual machine but not on the server was rather simple but took me a few weeks to figure out: On the virtual machine, the table is automatically reordered using "amdc". on the server, INLA did not choose to use "amdc" on its own and thus crashed for anything above 200,000. Simply adding control.inla(....., reordering = "amdc") solved the problem and all models I wanted to evaluate ran like a charm on the server and finished in 1-2 days. May this info be of use for someone else with a similar problem someday...

If someone wants to add a few sentences what "amdc" actually does, feel free to enlighten me.
Reply all
Reply to author
Forward
0 new messages