R-square and R-square difference as parameters

Shu Fai Cheung (張樹輝)

unread,

Jan 15, 2023, 8:32:18 AM1/15/23

to lavaan

Hi lavaan users,

I recall that users have asked questions about R-squares and differences in R-squares. I know lavaan can print R-squares directly in the output. However,

(a) they are not parameters and we cannot use them for user-defined parameters, and

(b) no p-values are available for them.

A trick came to my mind. I am not sure whether this has been mentioned before. This trick might have already been proposed because R-square is a topic frequently asked here. However, I searched and have not yet found this solution in previous posts (sorry if I overlooked related posts). So, I share it here (maybe again?).

This trick is really a "trick"/workaround, doing something in an unconventional way. Therefore, I share it here *not* to claim that it must work. Instead, I share it here to seek comments on two questions:

(a) Does it really work? Did I overlook something?

(b) If it does *sometimes* work, when it does *not* work?

This is an example:

library(lavaan)
#> This is lavaan 0.6-13
#> lavaan is FREE software! Please report any bugs.
set.seed(870432)
n <- 200
x1 <- rnorm(n)
x2 <- rnorm(n)
y1 <- .3 * x1 + .3 * x2 + sqrt(1 - .3^2 - .3^2) * rnorm(n)
y2 <- .0 * x1 + .1 * x2 + sqrt(1 - .0^2 - .1^2) * rnorm(n)
dat <- data.frame(x1, x2, y1, y2)

mod <-
"
y1 ~ x1 + x2
y2 ~ x1 + x2
y1 ~~ ev1 * y1
y2 ~~ ev2 * y2
y1rsq := 1 - ev1
y2rsq := 1 - ev2
rsqdiff := y1rsq - y2rsq
"

fit <- lavaan::sem(mod, dat)
# y1rsq, y2rsq, and rsqdiff in the original solution
# are not interpreted. They can be negative.
parameterEstimates(fit, rsquare = TRUE)[11:15, ]
#> lhs op rhs label est se z pvalue ci.lower ci.upper
#> 11 y1rsq := 1-ev1 y1rsq 0.214 0.079 2.725 0.006 0.060 0.368
#> 12 y2rsq := 1-ev2 y2rsq -0.008 0.101 -0.075 0.940 -0.205 0.190
#> 13 rsqdiff := y1rsq-y2rsq rsqdiff 0.222 0.127 1.743 0.081 -0.028 0.471
#> 14 y1 r2 y1 0.113 NA NA NA NA NA
#> 15 y2 r2 y2 0.012 NA NA NA NA NA
# y1rsq, y2rsq, and rsqdiff in the standardized solution
# are indeed the R-squares.
standardizedSolution(fit)[11:13, ]
#> lhs op rhs label est.std se z pvalue ci.lower ci.upper
#> 11 y1rsq := 1-ev1 y1rsq 0.113 0.041 2.762 0.006 0.033 0.194
#> 12 y2rsq := 1-ev2 y2rsq 0.012 0.015 0.798 0.425 -0.018 0.043
#> 13 rsqdiff := y1rsq-y2rsq rsqdiff 0.101 0.043 2.364 0.018 0.017 0.184

# Verify the results
est <- parameterEstimates(fit, rsquare = TRUE)
std <- standardizedSolution(fit)
# Compare the R-squares
est[14:15, "est"]
#> [1] 0.11322392 0.01236012
std[11:12, "est.std"]
#> [1] 0.11322392 0.01236012
# Compare the R-square difference
est[14, "est"] - est[15, "est"]
#> [1] 0.1008638
std[13, "est.std"]
#> [1] 0.1008638

To include R-square of a variable as a user-defined parameter, I labelled the error variance and then defined 1 - (error variance). In the unstandardized solution. this

user-defined variable is not meaningful, and can even be negative.

However, in the unstandardized solution (with std.all, the default), this variable is standardized. Therefore, 1 - (error variance after standardization) becomes the R-square.

With this trick, the difference between two R-squares can become a user-defined parameter, by computing the difference of the R-squares. Again, we need to look at the standardized solution.

WARNING: There are two issues with this approach:

1) When an R-square is close zero or sample size is small, the delta method SE, CI, and p-value in the standardized solution may not be appropriate due to the sampling distribution of R-square.

2) Even with se = "boot", the CIs in the standardized solution are not bootstrap CIs (they are delta method CIs based on bootstrap VCOV). Actually, even *if* the CIs in the standardized solution *are* bootstrap Cis, they may still not be appropriate for R-square in some cases.

Despite these two issues, it may still be useful to be able to use R-squares as user-defined parameters.

May I have your comments on this trick? Any pitfall?

-- Shu Fai

Shu Fai Cheung (張樹輝)

unread,

Jan 15, 2023, 8:56:22 AM1/15/23

to lavaan

Sorry for a typo:

"However, in the standardized solution (with std.all, the default), this variable is standardized. Therefore, 1 - (error variance after standardization) becomes the R-square."

-- Shu Fai

Shu Fai Cheung (張樹輝)

unread,

Jan 15, 2023, 10:44:30 AM1/15/23

to lavaan

A quick follow-up:

The difference in R-squares can also be defined directly from error variances:

rsqdiff := y1rsq - y2rsq

rsqdiff2 := ev2 - ev1

It is because

y1rsq := 1 - ev1

y2rsq := 1 - ev2

y1rsq - y2rsq = (1 - ev1) - (1 - ev2) = -ev1 + ev2 = ev2 - ev1

However, ev2 - ev1 is not as easy to read as y1rsq - y2rsq.

Note that the difference has the intended meaning only in the standardized solution.

-- Shu Fai

Christian Arnold

unread,

Jan 15, 2023, 12:58:58 PM1/15/23

to lav...@googlegroups.com

Hi Shu Fai,

brilliant, thank you very much for this. However, if I understand the approach correctly, it may be for example that the delta CI contain zero, which is hard to interpret for me.

Best

Christian

P.S.: Here is an example:

library(lavaan)

pop.model <- "

f2 ~ -0.9 * f1

f3 ~ 0.2 * f2 + 0.1 * f1

"

set.seed(870432)

n <- 200

data <- simulateData(pop.model, sample.nobs = n)

model <- "

f2 ~ f1

f3 ~ f2 + f1

f2 ~~ e1 * f2

f3 ~~ e2 * f3

f2.rsq := 1 - e1

f3.rsq := 1 - e2

"

fit <- sem(model, data)

lavInspect(fit, "rsquare")

standardizedSolution(fit)

lhs op rhs label est.std se z pvalue ci.lower ci.upper

1 f2 ~ f1 -0.730 0.028 -25.766 0.000 -0.785 -0.674

2 f3 ~ f2 0.142 0.102 1.391 0.164 -0.058 0.342

3 f3 ~ f1 0.010 0.102 0.094 0.925 -0.191 0.210

4 f2 ~~ f2 e1 0.468 0.041 11.313 0.000 0.387 0.549

5 f3 ~~ f3 e2 0.982 0.019 52.468 0.000 0.945 1.018

6 f1 ~~ f1 1.000 0.000 NA NA 1.000 1.000

7 f2.rsq := 1-e1 f2.rsq 0.532 0.041 12.883 0.000 0.451 0.613

8 f3.rsq := 1-e2 f3.rsq 0.018 0.019 0.973 0.331 -0.018 0.055

Why not just use - as usual - the bootstrap and the typical CI (perc, bc)? Example:

get.rsq <- function(x) lavInspect(x, "rsquare")

boot <- bootstrapLavaan(fit, R = 1000, FUN = get.rsq, verbose = TRUE)

alpha <- c(0.025, 0.975)

lavInspect(fit, "rsquare")

c(

ci.lower = as.numeric(quantile(boot[,"f2"], alpha[1], type = 6)),

ci.upper = as.numeric(quantile(boot[,"f2"], alpha[2], type = 6))

)

c(

ci.lower = as.numeric(quantile(boot[,"f3"], alpha[1], type = 6)),

ci.upper = as.numeric(quantile(boot[,"f3"], alpha[2], type = 6))

)

ci.lower ci.upper

0.4437990 0.6140211

ci.lower ci.upper

0.001508923 0.079317297

Von: lav...@googlegroups.com <lav...@googlegroups.com> im Auftrag von Shu Fai Cheung (張樹輝) <shufai...@gmail.com>
Gesendet: Sonntag, 15. Januar 2023 14:32
An: lavaan <lav...@googlegroups.com>
Betreff: R-square and R-square difference as parameters

--
You received this message because you are subscribed to the Google Groups "lavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lavaan+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lavaan/26b85007-be1c-446b-aeed-618e841cf90bn%40googlegroups.com.

Shu Fai Cheung (張樹輝)

unread,

Jan 15, 2023, 9:35:17 PM1/15/23

to lavaan

You are right. bootstrapLavaan() can already be used. I am interested in how to somehow have R-squares right in the parameter table, without the need call another function. We may then do something else with them. (How to do that right needs further examination ...).

By the way, it can be problematic to use bootstrapping CI for R-square. The R-square computed that way is non-negative. If the population R-square is zero, then the coverage probability of the bootstrap CI is necessarily 0%. The bootstrap CI then cannot be used to test the null hypothesis. Even if the population R-square is not zero, if the sampling variance is large relative to the population R-square, the coverage probability of the bootstrap CI can also be lower than the nominal level. This paper is related, though on regression:

Ohtani, K. (2000). Bootstrapping R2 and adjusted R2 in regression analysis. Economic Modelling, 17(4), 473–483. https://doi.org/10.1016/S0264-9993(99)00034-6

Interestingly, although the adjusted R-square can be negative and may "look" inconvenient, the bootstrap CI can work better for the adjusted R-square than for the (unadjusted) R-square in terms of coverage probability.

That said, I am not advocating the unconditional use of the delta method CI in this case. This is an issue that needs further discussion.

Moreover, the paper by Ohtani is on regression. In SEM, the sample R-square can be negative (unless we impose some constraints), like the adjusted R-square.

An interesting related issue: whether a negative lower limit of the CI for an R-square is a problem. I recall there was a thread in SEMNET on a related issue last summer, though that discussion was mainly on the point estimate (negative variance), not on the CI.

A related paper:

Savalei, V., & Kolenikov, S. (2008). Constrained versus unconstrained estimation in structural equation modeling. Psychological Methods, 13(2), 150–170. https://doi.org/10.1037/1082-989X.13.2.150

Last, the issue with negative lower limit is not due to the trick. This can happen even if we compute the R-square directly using ":="

library(lavaan)
#> This is lavaan 0.6-13
#> lavaan is FREE software! Please report any bugs.
set.seed(870432)
n <- 200
x1 <- rnorm(n)
x2 <- rnorm(n)
y1 <- .3 * x1 + .3 * x2 + sqrt(1 - .3^2 - .3^2) * rnorm(n)
y2 <- .0 * x1 + .1 * x2 + sqrt(1 - .0^2 - .1^2) * rnorm(n)
dat <- data.frame(x1, x2, y1, y2)

mod <-
"
y1 ~ x1 + x2

y2 ~ b1*x1 + b2*x2

y1 ~~ ev1 * y1
y2 ~~ ev2 * y2
y1rsq := 1 - ev1
y2rsq := 1 - ev2

x1 ~~ vx1*x1
x2 ~~ vx2*x2
x1 ~~ vx1x2*x2
vy2 := ev2 + (b1^2)*vx1 + (b2^2)*vx2 + 2*b1*b2*sqrt(vx1*vx2)*vx1x2
y2rsq2 := (vy2 - ev2) / vy2

"

fit <- lavaan::sem(mod, dat)

# Direct computation
parameterEstimates(fit, rsquare = TRUE)[14, -3]
#> lhs op label est se z pvalue ci.lower ci.upper
#> 14 y2rsq2 := y2rsq2 0.012 0.015 0.796 0.426 -0.018 0.043
# Based on the "trick"
standardizedSolution(fit)[12, -3]
#> lhs op label est.std se z pvalue ci.lower ci.upper
#> 12 y2rsq := y2rsq 0.012 0.016 0.796 0.426 -0.018 0.043

The SEs are slightly different, though it may be due to the imprecision in the computation. In any case, the lower limits of the delta-method CIs are negative in both cases.

-- Shu Fai

P.S.: Sorry if the code is not displayed in fixed font (Courier New). I did set the font but sometimes Google Group discards the format. I don't know why.

Shu Fai Cheung (張樹輝)

unread,

Jan 15, 2023, 9:38:57 PM1/15/23

to lavaan

Sorry. Just found that I contradicted with myself:

I wrote:

> The R-square computed that way is non-negative.

Then I wrote:

> In SEM, the sample R-square can be negative (unless we impose some constraints), like the adjusted R-square.

So, the appropriateness of bootstrap CI for the R-square may depend on the presence or absence of constraints in the estimation (See Savalei & Kolenikov).

-- Shu Fai

Shu Fai Cheung (張樹輝)

unread,

Jan 15, 2023, 9:58:00 PM1/15/23

to lavaan

In OpenMx, we have a lot of flexibility to define user parameters in the model. We can use labels but we can also use the model matrices directly However, in lavaan, currently we can only form user parameters using labels. (There is a workaround but not officially supported and so need to be used with cautions.) To do something complicated, we need to do additional steps (e.g., use bootstrapLavaan()). That's why I wonder if there are ways to have those parameters, e.g., R-squares, defined right inside the model syntax, without the need to do the computation as I did in the second example, which will become exceedingly complicated as the number of predictors increases or when one of the predictors is a mediator. That's the motive behind that trick.

-- Shu Fai

Ivan Jacob Pesigan

unread,

Jan 16, 2023, 12:18:41 AM1/16/23

to lav...@googlegroups.com

I think the trick is useful in as much as it makes specifying the R-squared conveniently. Yes, we can always derive a new parameter as a function of specified parameters in the model but it gets complicated quickly even in very simple models. This is my take on the trick but using the Monte Carlo method implemented by the semmcci package which uses the output of lavaan to implement the Monte Carlo method. The standardized solution (`semmcci::MCStd()`) makes good use of the trick (see Example 1 below). Similar to other approaches, the issue with the Monte Carlo method is the coverage limitation on the lower boundary. In Example 2, for a 95% CI (2.5% = 0.0001 and 97.5% = 0.0088), zero is not covered even if the population value is zero.

Jek

# Example 1

set.seed(42)
n <- 1000
a <- 0.50
b <- 0.50
cp <- 0.25
s2_em <- 1 - a^2
s2_ey <- 1 - cp^2 - a^2 * b^2 - b^2 * s2_em - 2 * cp * a * b
rsqm <- 1 - s2_em
rsqy <- 1 - s2_ey
em <- rnorm(n = n, mean = 0, sd = sqrt(s2_em))
ey <- rnorm(n = n, mean = 0, sd = sqrt(s2_ey))
X <- rnorm(n = n)
M <- a * X + em
Y <- cp * X + b * M + ey
df <- data.frame(X, M, Y)

model <- "
Y ~ cp * X + b * M
M ~ a * X
M ~~ sm * M
Y ~~ sy * Y
indirect := a * b
direct := cp
total := cp + (a * b)
rsqm := 1 - sm
rsqy := 1 - sy
"

fit <- lavaan::sem(model, df)
mc_unstd <- semmcci::MC(fit)
semmcci::MCStd(mc_unstd)

#> Standardized Monte Carlo Confidence Intervals
#> est se R 0.05% 0.5% 2.5% 97.5% 99.5% 99.95%
#> cp 0.2422 0.0267 20000 0.1519 0.1724 0.1893 0.2934 0.3104 0.3258
#> b 0.5123 0.0247 20000 0.4277 0.4475 0.4633 0.5600 0.5752 0.5900
#> a 0.4963 0.0221 20000 0.4223 0.4372 0.4522 0.5388 0.5514 0.5646
#> sm 0.7537 0.0219 20000 0.6812 0.6960 0.7097 0.7955 0.8089 0.8217
#> sy 0.5558 0.0227 20000 0.4818 0.4969 0.5111 0.6005 0.6133 0.6282
#> indirect 0.2542 0.0169 20000 0.2005 0.2119 0.2213 0.2873 0.2990 0.3105
#> direct 0.2422 0.0267 20000 0.1519 0.1724 0.1893 0.2934 0.3104 0.3258
#> total 0.4964 0.0225 20000 0.4192 0.4353 0.4509 0.5388 0.5514 0.5650
#> rsqm 0.2463 0.0219 20000 0.1783 0.1911 0.2045 0.2903 0.3040 0.3188
#> rsqy 0.4442 0.0227 20000 0.3718 0.3867 0.3995 0.4889 0.5031 0.5182

# Example 2
set.seed(42)
n <- 1000
a <- 0
b <- 0
cp <- 0
s2_em <- 1 - a^2
s2_ey <- 1 - cp^2 - a^2 * b^2 - b^2 * s2_em - 2 * cp * a * b
rsqm <- 1 - s2_em
rsqy <- 1 - s2_ey
em <- rnorm(n = n, mean = 0, sd = sqrt(s2_em))
ey <- rnorm(n = n, mean = 0, sd = sqrt(s2_ey))
X <- rnorm(n = n)
M <- a * X + em
Y <- cp * X + b * M + ey
df <- data.frame(X, M, Y)

fit <- lavaan::sem(model, df)
mc_unstd <- semmcci::MC(fit)
semmcci::MCStd(mc_unstd)

#> Standardized Monte Carlo Confidence Intervals
#> est se R 0.05% 0.5% 2.5% 97.5% 99.5% 99.95%
#> cp -0.0175 0.0318 20000 -0.1160 -0.0989 -0.0799 0.0439 0.0633 0.0860
#> b 0.0096 0.0313 20000 -0.0939 -0.0694 -0.0521 0.0713 0.0898 0.1086
#> a -0.0214 0.0316 20000 -0.1259 -0.1029 -0.0832 0.0412 0.0607 0.0818
#> sm 0.9995 0.0019 20000 0.9841 0.9894 0.9930 1.0000 1.0000 1.0000
#> sy 0.9996 0.0024 20000 0.9844 0.9880 0.9912 0.9999 1.0000 1.0000
#> indirect -0.0002 0.0012 20000 -0.0076 -0.0051 -0.0032 0.0022 0.0037 0.0055
#> direct -0.0175 0.0318 20000 -0.1160 -0.0989 -0.0799 0.0439 0.0633 0.0860
#> total -0.0177 0.0318 20000 -0.1162 -0.0991 -0.0805 0.0437 0.0632 0.0858
#> rsqm 0.0005 0.0019 20000 0.0000 0.0000 0.0000 0.0070 0.0106 0.0159
#> rsqy 0.0004 0.0024 20000 0.0000 0.0000 0.0001 0.0088 0.0120 0.0156

To view this discussion on the web visit https://groups.google.com/d/msgid/lavaan/0aa87b69-6110-4617-9f28-5df722bf00c5n%40googlegroups.com.

Ivan Jacob Pesigan

unread,

Jan 16, 2023, 12:24:58 AM1/16/23

to lav...@googlegroups.com

As a follow-up on Example 2, I run a small scale simulation and coverage for `rsqy` is poor. I reckon, having poor coverage on the boundary is similar to the bootstrap approach.

Jek

Reply all

Reply to author

Forward