INLA failing without any error on HPC cluster

535 views
Skip to first unread message

Scott Burman

unread,
Jul 19, 2021, 4:29:24 AM7/19/21
to R-inla discussion group
I have been scheduling large sets of INLA models on a large compute cluster. Scheduling and resource allocation are done via SLURM (if that matters). I'm running models with 24 cores each, ~1.5GB memory/core. I seem to have squashed all the seg faults and other errors. Now, it just returns "done" within the output file and ends. In the error file, it reports "killed" with no other details.

Here's the contents of my SLURM error file:
Error in inla.inlaprogram.has.crashed() : 
  The inla-program exited with an error. Unless you interupted it yourself, please rerun with verbose=TRUE and check the output carefully.
  If this does not help, please contact the developers at <he...@r-inla.org>.
Calls: inla -> inla.inlaprogram.has.crashed
In addition: Warning message:
In inla.model.properties.generic(inla.trim.family(model), mm[names(mm) ==  :
  Model 'z' in section 'latent' is marked as 'experimental'; changes may appear at any time.
  Use this model with extra care!!! Further warnings are disabled.
Execution halted

I am running INLA with debug=TRUE (and debugging turned on via the environment variable), and I am using pardiso. Pardiso returns a few errors: *** PARDISO ERROR(0): not pos.def matrix: 15 eigenvalues are negative.
*** PARDISO ERROR: I will try to work around the problem...

But these seem not to be catastrophic. The level is a nested model similar to the one in chapter 4 of Gomez-Rubio 2020. 

I am at a bit of a loss as the output from SLURM contains absolutely no errors (aside from the above PARDISO ERROR) and no other memory or resource related issues. I'm happy to post any other useful information.

Thanks for your time and help.
Scott

Scott Burman

unread,
Jul 19, 2021, 5:56:47 AM7/19/21
to R-inla discussion group
I somehow got a more helpful error: 
inla.mkl: smtp-pardiso.c:757: GMRFLib_pardiso_solve_core: Assertion `store->pstore->done_with_chol == GMRFLib_TRUE' failed.
What does this mean?

Thanks!

Scott Burman

unread,
Jul 19, 2021, 6:00:13 AM7/19/21
to R-inla discussion group
Is this because I used the Intel MKL library?

Helpdesk

unread,
Jul 19, 2021, 10:01:41 AM7/19/21
to Scott Burman, R-inla discussion group

I think the first to check if this runs without SLURM in interactive
mode, as then the crashes with due to 'slurm' / running in batch mode

the 'not pos.def matrix' is usually due to ill conditioned matrix, most
often again, due to vague model and not that informative data. tips here
is setting

control.fixed=list(prec=1,prec.intercept=1)

or similar.

you can also try the newest testing version you get with R-4.1 and then
add option inla.mode="experimental"

let me know
H


On Mon, 2021-07-19 at 01:29 -0700, 'Scott Burman' via R-inla discussion
> --
> You received this message because you are subscribed to the Google
> Groups "R-inla discussion group" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to r-inla-discussion...@googlegroups.com.
> To view this discussion on the web, visit
> https://groups.google.com/d/msgid/r-inla-discussion-group/75035c3a-0b02-49ed-b2d6-d59b2af18741n%40googlegroups.com
> .

--
Håvard Rue
he...@r-inla.org

Scott Burman

unread,
Jul 19, 2021, 3:17:36 PM7/19/21
to R-inla discussion group
I added 
inla.set.control.fixed.default(prec=1,prec.intercept=1)

Unfortunately, I cannot run anything interactively on the HPC, and I do not know that I have access to any computer with sufficient memory to run this model. I will certainly try though.

It continues to have the same error:
inla.mkl: smtp-pardiso.c:757: GMRFLib_pardiso_solve_core: Assertion `store->pstore->done_with_chol == GMRFLib_TRUE' failed.

Is this error with pardiso, would it be advisable to not use pardiso?
What does inla.mode="experimental" do?

What other information can I provide? I'm still very much in the beginning phase of my modeling, so my priors are all uninformative, and the design matrices are also quite simple. I will build the final model once I have a running multi-level model.

Thanks!

Scott Burman

unread,
Jul 20, 2021, 6:32:01 PM7/20/21
to R-inla discussion group
I think I've solved most of the issues I was having before with my simplest nested multi-level model. I'm now adding a temporal component (using rw1/rw2) and am once again hitting the above error.
inla.mkl: smtp-pardiso.c:757: GMRFLib_pardiso_solve_core: Assertion `store->pstore->done_with_chol == GMRFLib_TRUE' failed.
It sounds like this is a memory issue? Runs that trigger that error, oddly never use much memory, and fail well before any memory usage takes off.

I've now added "huge" to my inla call's control.compute. 
I also added inla.set.control.fixed.default(prec=1,prec.intercept=1), which seemed to help until I added the temporal component.
I also saw that elsewhere, you advised using
control.inla=list(strategy="adaptive", int.strategy="eb")
This has also not helped with these errors and segmentation faults.

What debugging info do you need to see?

Finally, not using the intel MKL library cuts the run time by about 10% and cuts memory use by ~15-20%.

Is there any documentation for all of these control settings, aside from the R help docs? I'm a bit confused about what they all do and when to use them.

Thanks!

Helpdesk

unread,
Jul 21, 2021, 3:32:00 AM7/21/21
to Scott Burman, R-inla discussion group

it would be easier if I could rerun it here to check, if so, send me (to
he...@r-inla.org) code and (fake) data.

the error says it meet numerical singular matrix, and there might be
several reasons for this. its usually just a quick fix to make it work.

you may try the most recent testing version (which require R-4.1), and
add

inla(..., inla.mode="experimental")

this should use less RAM and be more robust for this kind of issues.
anyhow, please send so I can re-check

H


On Tue, 2021-07-20 at 15:32 -0700, 'Scott Burman' via R-inla discussion

Scott Burman

unread,
Jul 21, 2021, 1:41:19 PM7/21/21
to Helpdesk, R-inla discussion group
I'm working on getting R updated and will reach out when I hear back.

Thanks! 

Scott Burman

unread,
Jul 22, 2021, 6:15:26 PM7/22/21
to R-inla discussion group
I've gotten 4.1 installed on our cluster. But since the update to 4.1, I cannot get INLA to install as neither sf nor rgdal are working on Ubuntu due to a versioning issue in libproj. To that same end, I cannot get them installed on my laptop running 4.1 either. 

INLA is installed on R 4.0.2, but I cannot get it installed anywhere on 4.1 in either Arch or Ubuntu.

Thanks

Finn Lindgren

unread,
Jul 22, 2021, 7:57:29 PM7/22/21
to Scott Burman, R-inla discussion group
Hi Scott,

do you have more details on the libproj issue on the laptop? I also run Ubuntu, and have no such issue with R 4.1. My current system libproj versions are

ii  libproj-dev:amd64                                           6.3.1-1                               amd64        Cartographic projection library (development files)
ii  libproj15:amd64                                             6.3.1-1                               amd64        Cartographic projection library

rgdal and sf versions 1.5-23 and 1.0-1.

Finn




--
Finn Lindgren
email: finn.l...@gmail.com

Scott Burman

unread,
Jul 22, 2021, 8:01:56 PM7/22/21
to R-inla discussion group
I've got Arch on my laptop. I believe the issue is that gdal and sf require different versions of libproj. At least that's the issue on my laptop. 

The issue I'm having on our compute cluster revolves around the fact that for whatever reason, it's not seeing libproj at all, even my locally built libproj in my home directory. I tried pointing both rgdal and sf to the directory in R when installing, but both report an error that they fail to see it. I'm not sure if this is a versioning issue or what.

Scott Burman

unread,
Jul 22, 2021, 9:17:16 PM7/22/21
to R-inla discussion group
I have FINALLY gotten a useful error. I appologize for all the emails. So, I got my runs going in 4.0.2 instead of 3.6.x. I also loaded the clusters mkl and re-enabled the mkl flag.  The result was the following when running my more complex models.

GMRFLib version 3.0-0-snapshot, has recived error no [17]
Reason    : Constraints or its covariance matrix is singular
Function  : GMRFLib_init_problem_store
File      : problem-setup.c
Line      : 928
GitID     : file: problem-setup.c  1f6a39183ef43d8ef33f10ff3f04fd13f8432758 - Mon Feb 22 21:27:50 2021 +0300


Helpdesk

unread,
Jul 23, 2021, 7:59:47 AM7/23/21
to Scott Burman, R-inla discussion group
can you upgrade to the most recent testing version as there is a fix for
this issue (which can also be a user input error with to many
constraints so the marginals for the constraints are singular)


On Thu, 2021-07-22 at 18:17 -0700, 'Scott Burman' via R-inla discussion
Reply all
Reply to author
Forward
0 new messages