Explained variance for CFA(?)

Rémi Thériault

unread,

Aug 30, 2021, 7:14:24 PM8/30/21

to lavaan

Exploratory Factor Analysis (EFA) using fa() function provides in its output a row called Proportion Var. You can typically get the variance explained for the whole model by summing the Proportion Var for each component. For example:

# Make EFA model

library(psych)

fit <- fa(mydata, nfactors = 4, rotate = "oblimin", scores = "Bartlett", fm = "minres")

# Get variance explained for each component

fit$Vaccounted[2,]

# Get sum of variance explained

sum(unlist(fit$Vaccounted[2,]))

My coauthor and I would like to compare this number for EFA with its equivalent for Confirmatory Factor Analysis (CFA), but the cfa() function does not seem to provide Proportion Var per default.

So first, is there a statistical reason why the cfa() output does not provide Proportion Var or an equivalent? Should we be avoiding attempting this comparison or even obtaining explained variance for CFA at all?

Second, I have attempted to calculate this value myself through a custom function (using source 1 and source 2):

mycfa <- function(data, indices) {

d <- data[indices,]

fit <- cfa(model, data = d, estimator = "MLR")

inspect(fit, what = "std")$lambda -> loadings

sum(loadings[,1]^2) -> SS.loadings

Proportion.Var.MR1 <- sum(loadings[,1]^2)/nrow(loadings)

Proportion.Var.MR2 <- sum(loadings[,2]^2)/nrow(loadings)

Proportion.Var.MR3 <- sum(loadings[,3]^2)/nrow(loadings)

Proportion.Var.MR4 <- sum(loadings[,4]^2)/nrow(loadings)

output <- Proportion.Var.MR1 + Proportion.Var.MR2 + Proportion.Var.MR3 + Proportion.Var.MR4

output

}

The idea would be to compare two distributions/histograms of 10,000 bootstrapped samples of the explained variances for the EFA and CFA, respectively, to see if they overlap, and to what extent. Would that approach make sense?

So to bootstrap the 10,000 explained variances for CFA, I simply use the custom function above along with the boot package:

library(boot)

(vars.boot <- boot(data = mydata, statistic = mycfa, R = 10000))

The output, however, provides 275 impossible values (out of 10,000)—i.e., values greater than 1 (the variance explained shouldn't be greater than 1). Would there be any explanation for this? That's not much, but it does seem like a problem. Is the function at fault here, or is the bootstrapping, or the combination of the two?

While running my model, I do get this warning:

In lav_object_post_check(object) :

lavaan WARNING: some estimated ov variances are negative

But I am not sure what to do with this or what the implications are. Could this be related to the impossible bootstrapped values? Thank you very much.

Related/relevant conversations:

https://groups.google.com/g/lavaan/c/tPEPXQqQR48/m/l_JfDsw5GAAJ
https://groups.google.com/g/lavaan/c/W5rIa2eo3uQ

Terrence Jorgensen

unread,

Aug 31, 2021, 4:15:38 AM8/31/21

to lavaan

is there a statistical reason why the cfa() output does not provide Proportion Var or an equivalent?

There is a theoretical reason. How much variance across indicators that a factor explains was developed as a justification for whether to retain that factor from an EFA. In CFA, you justification for retaining a factor is the reason that you specified it to begin with: that is the factor that you meant to measure with the indicators.

Should we be avoiding attempting this comparison or even obtaining explained variance for CFA at all?

As I said in your second link, you can't clearly delineate "this" from "that" variance being explained by "this" or "that" factor when the factors are correlated. I think it is already a muddy interpretation in EFA when an oblique rotation is used.

I have attempted to calculate this value myself through a custom function

Looks nice, I think those are the right formulas. See also Jeremy's comments in this thread:

https://stats.stackexchange.com/questions/128671/minimum-cumulative-variance-to-extract-in-exploratory-factor-analysis-to-ensure

The idea would be to compare two distributions/histograms of 10,000 bootstrapped samples of the explained variances for the EFA and CFA, respectively, to see if they overlap, and to what extent. Would that approach make sense?

I see nothing wrong with that exploration, but what is the purpose? EFA will explain more variance because it estimates more loadings. Are you looking for a way to show that adding cross-loadings to a CFA doesn't explain much more variance? If your goal is to justify being satisfied with the approximate fit of the CFA, I don't think this provides a justification because the EFA and CFA have the same functional forms, but less/more restrictive versions of the same model. Your model can fail in ways beyond not estimating enough parameters.

The output, however, provides 275 impossible values (out of 10,000)—i.e., values greater than 1 (the variance explained shouldn't be greater than 1). Would there be any explanation for this?

You are calculating the sum of squared loadings. That can lead to > 1 when standardized loadings > 1 or (sometimes equivalently) when residual variances < 0, i.e.. Heywood cases, which can occur simply due to sampling error: https://doi.org/10.1177/0049124112442138

Terrence D. Jorgensen

Assistant Professor, Methods and Statistics

Research Institute for Child Development and Education, the University of Amsterdam

http://www.uva.nl/profile/t.d.jorgensen

Mauricio Garnier-Villarreal

unread,

Aug 31, 2021, 9:01:01 AM8/31/21

to lavaan

Remi

I particularly never liked that "Total explained variance" terms. It feels that they make more sense for PCA, which doesnt care about specific "where" is the information being explained, so it doesnt make sense to me in Factor Analysis. I think items R2 make more sense because you are looking at how well it works for specific items

Rémi Thériault

unread,

Sep 3, 2021, 4:32:12 PM9/3/21

to lav...@googlegroups.com

Thank you very much all for your answers, this is very helpful.

I see nothing wrong with that exploration, but what is the purpose? EFA will explain more variance because it estimates more loadings. Are you looking for a way to show that adding cross-loadings to a CFA doesn't explain much more variance? If your goal is to justify being satisfied with the approximate fit of the CFA, I don't think this provides a justification because the EFA and CFA have the same functional forms, but less/more restrictive versions of the same model. Your model can fail in ways beyond not estimating enough parameters.

My understanding is that the purpose would be to show that one model is “better” than the other, not only in terms of fit indices (e.g., BIC), but also in terms of total variance explained. The bootstrapped histograms would be to visually show that the variances from EFA and CFA do not overlap, and therefore that one is clearly or substantially “better” than the other. The context would be a CFA attempting to validate/replicate a former structure that actually provides a poor fit, and compare that to a new EFA of our own with a much better fit.

Thanks again for clarifying these points.

Reply all

Reply to author

Forward