Parallel Bootstrapping in lavaan using future.apply on Windows

97 views
Skip to first unread message

Christian Arnold

unread,
Sep 14, 2025, 8:39:39 AM (4 days ago) Sep 14
to lavaan
Dear lavaan community,

I’d like to share an idea and would appreciate your feedback.

Recently, I came across the future.apply package, which enables parallel processing via multisession on Windows machines. I thought this might be useful for bootstrapping in lavaan, and potentially for other functions that fit models repeatedly, such as permuteMeasEq().

To test this, I ran a small benchmark using a simplified bootstrap function (largely based on bootstrapLavaan) with and without future_sapply. Here's the code:

library(lavaan)
library(future.apply)
plan(multisession)


# Simplified bootstrap function
simpleBoot <- function(object, R, future.sapply) {
 
  lavdata <- object@Data
  lavmodel <- object@Model
  lavsamplestats <- object@SampleStats
  lavpartable <- object@ParTable
  lavoptions <- object@Options
  lavoptions$check.start <- FALSE
 
  lavoptions$check.post <- FALSE
  lavoptions$optim.attempts <- 1L
  lavoptions$baseline <- FALSE
  lavoptions$h1 <- FALSE
  lavoptions$loglik <- FALSE
  lavoptions$implied <- FALSE
  lavoptions$store.vcov <- FALSE
  lavoptions$test <- "none"
  lavoptions$se <- "none"
 
  if(!future.sapply) {
    out <- sapply(1 : R, function(i) {
      coef(doBoot(lavdata, lavmodel, lavsamplestats, lavpartable, lavoptions))
    })
  } else {
    out <- future_sapply(1 : R, function(i) {
      coef(doBoot(lavdata, lavmodel, lavsamplestats, lavpartable, lavoptions))
    }, future.seed = TRUE)
  }
  t(out)
}


doBoot <- function(lavdata, lavmodel, lavsamplestats, lavpartable, lavoptions) {
  BOOT.idx <- vector("list", length = lavdata@ngroups)
  dataX <- lavdata@X
  for (g in 1:lavdata@ngroups) {
    boot.idx <- sample.int(nrow(lavdata@X[[g]]), replace = TRUE)
    BOOT.idx[[g]] <- boot.idx
    dataX[[g]] <- dataX[[g]][boot.idx, , drop = FALSE]
  }
  newData <- lav_data_update(
    lavdata = lavdata, newX = dataX,
    BOOT.idx = BOOT.idx,
    lavoptions = lavoptions
  )
 
  bootSampleStats <- lav_samplestats_from_data(
    lavdata = newData,
    lavoptions = lavoptions
  )
 
  lavaan(
    slotData = newData,
    slotModel = lavmodel,
    slotSampleStats = bootSampleStats,
    slotParTable = lavpartable,
    slotOptions = lavoptions
  )
}


model <- "
  # measurement model
    ind60 =~ x1 + x2 + x3
    dem60 =~ y1 + y2 + y3 + y4
    dem65 =~ y5 + y6 + y7 + y8
  # regressions
    dem60 ~ ind60
    dem65 ~ ind60 + dem60
  # residual correlations
    y1 ~~ y5
    y2 ~~ y4 + y6
    y3 ~~ y7
    y4 ~~ y8
    y6 ~~ y8"

fit <- sem(model, data = PoliticalDemocracy)

R <- 4000L

system.time({simpleBoot(fit, R = R, future.sapply = FALSE)})
system.time({simpleBoot(fit, R = R, future.sapply = TRUE)})


On my relatively slow machine with 4 cores, the classic version (without multisession) took about 120 seconds, while the parallel version completed in roughly 32 seconds.

I realize this may primarily benefit Windows users, but given how many people likely use Windows, I thought it could be worth sharing. Do you think this could be useful for broader lavaan functionality?

Best

Christian

Nanci Quispe

unread,
Sep 14, 2025, 8:43:50 AM (4 days ago) Sep 14
to lav...@googlegroups.com
Andade a la mierda

--
You received this message because you are subscribed to the Google Groups "lavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lavaan+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/lavaan/3112dcbf-2f93-41da-ac8f-785f705c5ffbn%40googlegroups.com.

Shu Fai Cheung (張樹輝)

unread,
Sep 14, 2025, 8:53:37 AM (4 days ago) Sep 14
to lavaan
The lavaan package already supports parallel processing in bootstrapping, through the arguments parallel, ncpus, and cl (though I guess not many users use cl, unless working across multiple Windows machines). They are described in the lavOptions page:


The function bootstrapLavaan() also supports parallel processing.

However, I am not familiar with the future.apply package. Perhaps it has some advantages over lavaan's built-in support for parallel processing?

-- Shu Fai

Christian Arnold

unread,
Sep 14, 2025, 9:01:02 AM (4 days ago) Sep 14
to lav...@googlegroups.com

Thank you for your valuable comment.




Von: lav...@googlegroups.com <lav...@googlegroups.com> im Auftrag von Nanci Quispe <nanci...@gmail.com>
Gesendet: Sonntag, September 14, 2025 2:44:01 PM
An: lav...@googlegroups.com <lav...@googlegroups.com>
Betreff: Re: Parallel Bootstrapping in lavaan using future.apply on Windows

Christian Arnold

unread,
Sep 14, 2025, 11:04:55 AM (4 days ago) Sep 14
to lavaan
Hi Shu Fai,

Thanks for your helpful feedback!

I ran a few benchmarks to compare different approaches:

system.time({simpleBoot(fit, R = R, future.sapply = FALSE)})
#    User      System     Elapsed
#  118.92       3.43     122.44

system.time({simpleBoot(fit, R = R, future.sapply = TRUE)})
#    User      System     Elapsed
#    0.15       0.01      32.43

system.time({bootstrapLavaan(fit, R = R, parallel = "snow")})
#    User      System     Elapsed
#  120.83       3.03     124.03

system.time({bootstrapLavaan(fit, R = R, parallel = "multicore")})
#    User      System     Elapsed
#  128.13       3.49     132.72


I’m aware that "snow" is the appropriate option for parallel processing on Windows, and not "multicore". I included "multicore" here just for testing purposes, to see how it behaves (as expected, it didn’t improve performance).

Unless I’ve missed something in the setup, it seems that even with "snow", bootstrapLavaan() doesn’t yield a speed-up - at least on my machine. I didn’t manually set ncpus, since the documentation suggests it defaults to parallel::detectCores() - 1.

In contrast, using future_sapply() with multisession on Windows gives a significant speed-up.

Would be great to hear if others have found a workaround or if there's interest in integrating future.apply more directly.

Best,

Christian

Felipe Vieira

unread,
Sep 14, 2025, 1:54:08 PM (4 days ago) Sep 14
to lav...@googlegroups.com
Hi Christian, 

I don't think this explains the difference entirely, but I want to check it later: what version of lavaan do you currently have? The development version has "parallel::detectCores() - 2L" as default ("ncpus = max(1L, parallel::detectCores() - 2L)" to be more specific).

Best, 
Felipe. 


--
You received this message because you are subscribed to the Google Groups "lavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lavaan+un...@googlegroups.com.

Christian Arnold

unread,
Sep 14, 2025, 2:16:24 PM (4 days ago) Sep 14
to lav...@googlegroups.com
Hi Felipe,

Thanks for your feedback. I'm working with the current stable release of lavaan, not a test or development version. Based on my observations—though it's just a hypothesis—I suspect the issue may be related to the workers not having the lavaan package properly loaded in the parallel environment.

Shu Fai’s comment was very helpful in confirming that parallel processing should work in principle. That said, I still find that future_sapply() consistently delivers better performance than the built-in parallel options, at least on my system.

Best regards,  

Christian



Von: lav...@googlegroups.com <lav...@googlegroups.com> im Auftrag von Felipe Vieira <felip...@gmail.com>
Gesendet: Sonntag, September 14, 2025 7:54:11 PM

An: lav...@googlegroups.com <lav...@googlegroups.com>
Betreff: Re: Parallel Bootstrapping in lavaan using future.apply on Windows

Jeremy Miles

unread,
Sep 14, 2025, 3:39:21 PM (4 days ago) Sep 14
to lav...@googlegroups.com
Hi Christian

Would you mind sharing your code? 

I know that parallel processing is weird on Windows in R, I wondered if it would be worth trying in a different OS.

Jeremy



Christian Arnold

unread,
Sep 14, 2025, 3:56:39 PM (4 days ago) Sep 14
to lav...@googlegroups.com
Hi Jeremy,

Gladly—but could you clarify what you'd like me to share? I've already posted the full code I used. I haven’t tested the hypothesis systematically, as I haven’t modified the lavaan source code. It’s just an informed guess that future_sapply() and the broader future ecosystem might be more efficient than the traditional parallel approach, and that lavaan may not be properly loading the package within the worker processes (see the source code for bootstrapLavaan()).

Best,  

Christian

---

Wenn du möchtest, kann ich auch eine Variante mit etwas mehr technischer Tiefe oder einem diplomatischeren Ton für breitere Leserschaft formulieren. Sag einfach Bescheid!

Gesendet von Outlook für Android


Von: lav...@googlegroups.com <lav...@googlegroups.com> im Auftrag von Jeremy Miles <jeremy...@gmail.com>
Gesendet: Sonntag, September 14, 2025 9:39:23 PM

Jeremy Miles

unread,
Sep 14, 2025, 7:42:25 PM (4 days ago) Sep 14
to lav...@googlegroups.com
D'Oh! Sorry, I missed that. I should have read the whole thread.

i tried it on a 24 core Linux machine, and I get very similar results, except futures is much faster.

> parallel::detectCores()
[1] 24
> system.time({simpleBoot(fit, R = R, future.sapply = FALSE)})
   user  system elapsed
130.764   1.610 132.631
> system.time({simpleBoot(fit, R = R, future.sapply = TRUE)})
   user  system elapsed
  3.625   0.165  19.286
> system.time({bootstrapLavaan(fit, R = R, parallel = "snow")})
   user  system elapsed
124.672   1.444 126.058
> system.time({bootstrapLavaan(fit, R = R, parallel = "multicore")})
   user  system elapsed
125.083   1.488 126.520

 



Christian Arnold

unread,
Sep 15, 2025, 6:28:35 AM (3 days ago) Sep 15
to lavaan
Hi Jeremy, Felipe, and Shu Fai,

Thank you all for your helpful feedback!

Since I didn’t observe any performance gain using "snow", I initially assumed that parallel processing wasn’t available on Windows. @Shu Fai: Thanks for pointing out that parallel processing is indeed possible on Windows.

@Jeremy: Your results seem to suggest that parallel processing might not be working on Linux either. @Felipe: You provided the key insight—my assumptions were clearly incorrect. In the version I use (0.6-19), one must manually set ncpus to activate parallel processing. Based on this, I ran the following benchmarks:

system.time({bootstrapLavaan(fit, R = R)})
#    User      System     Elapsed
#   120.28       3.40      123.75

system.time({bootstrapLavaan(fit, R = R, parallel = "snow", ncpus = max(1L, parallel::detectCores() - 1L))})
#    User      System     Elapsed
#     0.10       0.04       36.30

system.time({bootstrapLavaan(fit, R = R, parallel = "snow", ncpus = max(1L, parallel::detectCores() - 2L))})
#    User      System     Elapsed
#     0.09       0.06       37.61

system.time({simpleBoot(fit, R = R, future.sapply = FALSE)})
#    User      System     Elapsed
#   117.34       3.29      120.70

system.time({simpleBoot(fit, R = R, future.sapply = TRUE)})
#    User      System     Elapsed
#     0.18       0.04       32.15

It’s clear that simpleBoot is slightly faster than bootstrapLavaan, because I excluded some overhead that wasn’t needed for this test. However, the difference is marginal. In this small benchmark, future doesn’t appear to be dramatically more efficient than "snow", but it is still noticeably faster.

Has anyone tried replicating similar benchmarks on Unix-based systems? I’d be curious to know whether future provides a substantial performance gain in those environments, and if the results are comparable to what I observed on Windows.

Thanks again for your insights!

Christian

Shu Fai Cheung (張樹輝)

unread,
Sep 15, 2025, 8:44:02 AM (3 days ago) Sep 15
to lavaan
Thanks for starting this thread. I have been exploring ways to improve the speed of bootstrapping in lavaan. It would be great to learn something from this discussion.

There are several separate issues I would like to discuss. Therefore, I would like to discuss each of them in a separate post.

First, on future.apply.

I am new to this package. Therefore, please correct me if I am wrong about this and related packages.

I took a quick look at the documentation. This is roughly what happens when we call plan(multisession), as far as I understand. It will start a certain number of workers (separate R processes). According to the help page, it calls parallel::makeClusterPSOCK(). That is, it does what parallel::makeCluster() does, though maybe with some additional settings (like exporting the environment, discussed below):


Therefore, I would like to compare future.apply::future_sapply() with the parallelized version of sapply, parallel::parLapply().
I revised simpleBoot() to simpleBoot2(), adding parallel::parLapply() as one of the method:

simpleBoot2 <- function(object,
                       R,
                       method = c("future.sapply",
                                  "sapply",
                                  "parSapply"),
                       cl = NULL) {

  lavdata <- object@Data
  lavmodel <- object@Model
  lavsamplestats <- object@SampleStats
  lavpartable <- object@ParTable
  lavoptions <- object@Options
  lavoptions$check.start <- FALSE

  lavoptions$check.post <- FALSE
  lavoptions$optim.attempts <- 1L
  lavoptions$baseline <- FALSE
  lavoptions$h1 <- FALSE
  lavoptions$loglik <- FALSE
  lavoptions$implied <- FALSE
  lavoptions$store.vcov <- FALSE
  lavoptions$test <- "none"
  lavoptions$se <- "none"

  if(method == "sapply") {

    out <- sapply(1 : R, function(i) {
      coef(doBoot(lavdata, lavmodel, lavsamplestats, lavpartable, lavoptions))
    })
  }
  if (method == "parSapply") {
    out <- parallel::parSapply(cl = cl,

                               1 : R,
                               function(i) {
      coef(doBoot(lavdata, lavmodel, lavsamplestats, lavpartable, lavoptions))
    })
  }
  if (method == "future.sapply") {

    out <- future_sapply(1 : R, function(i) {
      coef(doBoot(lavdata, lavmodel, lavsamplestats, lavpartable, lavoptions))
    }, future.seed = TRUE)
  }
  t(out)
}

I omitted sapply() in the following tests because it obviously must be slow. It uses only one CPU core.

To ensure all comparisons use the same number of workers, I set workers to 8 when calling plan()

plan(multisession,
     workers = 8)

I also used R = 4000L.

This is the result on my computer for simpleBoot() with future.sapply = TRUE:

> system.time({simpleBoot(fit, R = R, future.sapply = TRUE)})
   user  system elapsed
   0.08    0.00    7.22

To compare with parSapply(), I need to create the cluster of workers first and set up the environment. As far as I understand, plan(multisession) will start the cluster right away. You can see the sessions named Rscript when you call plan(multisession), eight such sessions in my case. Therefore, it is fair to exclude the creation of the cluster from the timing.

I am unsure when the objects in the environment (the global environment, by default) will be passed to the workers when using future_sapply(). Therefore, I also included the calls to export these objects in system.time(). These are the results:

> my_cl <- makeCluster(8)
> system.time({
+   clusterExport(cl = my_cl,
+                 c("doBoot", "fit"))
+   clusterEvalQ(cl = my_cl,
+               library(lavaan))
+   simpleBoot2(fit, R = R, method = "parSapply", cl = my_cl)
+ })
   user  system elapsed
   0.03    0.02    6.89
> stopCluster(my_cl)

The difference is minor. I ran the above two tests several times. No noticeable consistent differences between them. (I could have used microbenchmark or similar packages but these casual tests should be sufficient for our purpose.)

The calls clusterExport() and clusterEvalQ() are necessary for this test unless I further revise simpleBoot(). However, I want to make the comparison clean, so I did not do so. Actually, future_sapply() and/or multisession() also need to do these things (exporting things or loading packages in the workers), although they do these automatically. I am not sure when this happens, when calling future_sapply(), or when the workers are created by multisession().

I think the purpose of futureverse packages is to make parallel processing accessible, which is an excellent idea:


But being fast may not be its goal (at least not the main one). If speed is our goal, then we can just implement parallel processing in our functions, as I did above using parallel::parLapply(), unless we want to make use of the flexibility of the interface of futurevers packages.

-- Shu Fai

Shu Fai Cheung (張樹輝)

unread,
Sep 15, 2025, 9:32:26 AM (3 days ago) Sep 15
to lavaan
This post is specifically about comparing simpleBoot(), which uses future, with bootstrapLavaan() (and cfa()/sem(), as shown later).

It is not easy to compare simpleBoot() with bootstrapLavaan() in a fair way. As you mentioned, you skipped something when doing bootstrapping, like bootstrapLavaan() does internally. However, there may be something else tested in bootstrapLavaan() but not in simpleBoot(). You may notice that there are a lot of nonadmissible solutions when calling bootstrapLavaan(), something like this:

lavaan->bootstrapLavaan():
   1268 bootstrap runs resulted in nonadmissible solutions.

There are no such warnings in simpleBoot().

Therefore, I used another model and another dataset from lavaan, to minimize the number of such nonadmissible solutions:

model <-
"
# Only two factors used
visual  =~ x1 + x2 + x3
speed   =~ x7 + x8 + x9
"

fit <- cfa(model, data = HolzingerSwineford1939)

To make the test run longer for this simpler model, I used 8000 replications.

I also set the number of workers manually (2 in the following example) to ensure that both bootstrapLavaan() and future_sapply() use the same number of workers.

> plan(multisession,
+      workers = 2)

> system.time({simpleBoot(fit, R = R, future.sapply = TRUE)})
   user  system elapsed
   0.06    0.00   25.15
> system.time({bootstrapLavaan(fit, R = R, parallel = "snow", ncpus = 2)})
   user  system elapsed
   0.08    0.00   26.14
Warning message:
lavaan->bootstrapLavaan():  
   8 bootstrap runs resulted in nonadmissible solutions.

The difference is minor. Nevertheless, we need to note that the test is not entirely fair:

- The cluster of works (two in this case) has been started when we call plan(multisession). The time for simpleBoot() does not consider this process. The creation of the clusters by bootstrapLavaan() was counted towards the time for bootstrapLavaan().

- simpleBoot() and bootstrapLavaan() are not doing the bootstrapping in exactly the same way. The steps skipped may not be identical.

Nevertheless, if we know in advance what exactly we need and what kinds of models we will encounter, it is indeed possible to write a specialized fucntion like simpleBoot() that can be faster than bootstrapLavaan(), for that purpose. By design, bootstrapLavaan() is a general purpose function, and so it needs to do a lot of things to accommodate different possible scenarios.

By the way, I tried to skip things in lavaan as simpleBoot() does, using only lavaan arguments:

> system.time({simpleBoot(fit, R = R, future.sapply = TRUE)})
   user  system elapsed
   0.12    0.00   24.91
> system.time({cfa(model,
+                  data = HolzingerSwineford1939,
+                  se = "bootstrap",
+                  bootstrap = R,
+                  check.post = FALSE,
+                  check.start = FALSE,
+                  test = "none",
+                  baseline = FALSE,
+                  h1 = FALSE,
+                  parallel = "snow",
+                  ncpus = 2)})
   user  system elapsed
   0.07    0.00   24.11

The processing times, interestingly, are pretty similar:

- Shu Fai

On Monday, September 15, 2025 at 6:28:35 PM UTC+8 christia...@hhl.de wrote:

Shu Fai Cheung (張樹輝)

unread,
Sep 15, 2025, 9:54:49 AM (3 days ago) Sep 15
to lavaan
This post is about the number of cores/cpus/workers (I will just use workers below).

For comparing the methods with parallel processing, it is desirable to manually set the number of workers because different functions may have different default values.

plan(multisession) sets the number of workers by parallelly::availableCores().


bootstrapLavaan() set the number of workers (ncpus) to max(1L, parallel::detectCores() - 2L) or max(1L, parallel::detectCores() - 1L), depending on the versions.

On a typical desktop or laptop computer, I believe parallel::detectCores() and parallelly::availableCores() usually give the same results. Therefore, plan(multisession) will have more workers than bootstrapLavaan(), due to - 1L or -2L, and so naturally future_sapply() will run faster (but there are exceptions, which I may discuss in another post).

That's why I manually set the number of workers in the tests I did in another post.

-- Shu Fai

On Monday, September 15, 2025 at 6:28:35 PM UTC+8 christia...@hhl.de wrote:

Stas Kolenikov

unread,
Sep 15, 2025, 11:17:31 AM (3 days ago) Sep 15
to lav...@googlegroups.com
Some random reactions:

1. Please avoid detectcores():  https://www.jottr.org/2022/12/05/avoid-detectcores/ (I've had some oddities with it, like seeing double the number of CPUs or adding CPUs and GPUs, and I have seen people take up all CPUs on our corporate network computers -- not nice).
2. {future} is quite picky about the random number generators and makes a fair amount of fuss stating that the random seeds may not be reproducible. I have not been paying attention to that (my parallel workflows so far have been based on dplyr::split()), but for bootstrapping purposes, this probably have to be resolved.
3. I have seen future spend like half a minute spawning workers and passing the data to them (admittedly with a large file, like 4Gb). This will also be a problem if you have a 16Gb machine with 8 threads, you have a 2Gb data set, and you try to run ncpus=detectCores()-1 -- it will crash the memory. So you need to be wise with figuring out what you do and how you do it, and pretty much always use the workers argument of future::plan(multisession).

-- Stas Kolenikov, PhD, PStat (ASA, SSC) 
-- Principal Statistician, NORC @NORCnews
-- Opinions stated in this email are mine only, and do not reflect the position of my employer
-- Social media: @StatStas [ Twitter | mastodon.online ]
-- http://stas.kolenikov.name



Reply all
Reply to author
Forward
0 new messages