Distinguishing lavaan(), sem(), and cfa()

108 views
Skip to first unread message

Pierce Ekstrom

unread,
Jun 14, 2025, 7:18:39 AM6/14/25
to lavaan
I'm getting started learning Lavaan, and I made a naive mistake today in using the lavaan() function when I should have used the sem() function. I'm trying to understand why my mistake produced the effects it did.

The mistake I made
Reading through the tutorial here, I tried using lavaan to run a simple regression using my own data. The code I used was of the form:
testmodel <- '
  y ~ x1 + x2
  y ~ 1
'

summary(lavaan(testmodel, data = data))

y, x1, and x2 are all observed variables. data is a moderately large dataset (data frame) that includes missing values.

My model above (and several more complicated attempts) failed, with four warnings each time:
" Warning messages: 1: lavaan->lav_model_estimate(): initial model-implied matrix (Sigma) is not positive definite; check your model and/or starting parameters ."
and an error:
5: lavaan->lav_lavaan_step11_estoptim():  
   Model estimation FAILED! Returning starting values.


Finally, I figured out that I should have used sem() where I used lavaan() above. 

My questions are:
1) What caused the lavaan() command above to fail where sem() succeeded? Was it trying to treat y, x1, and/or x2 as latent variables? 

2) More generally, I'd like to know whether there is any straightforward heuristic on when it is best to use the sem(), cfa(), and lavaan() functions, or if the only answer/solution is to slowly learn and memorize their respective arguments and defaults. I know all three functions have documentation, but I do not yet feel expert enough to compare their ins-and-outs with that documentation alone.

3) I know this is a beginners' question, and I would be happy to be referred to any existing threads or documentation on this topic. I didn't find any searching in the group, but that may be because "sem," "cfa," "lavaan," and "function" are terms that come up quite a bit around here.

Eventually, I hope to
estimate cross-lagged panel models with random intercepts, following Mulder & Hamaker, (2020). They seem to rely on the lavaan() and cfa() functions for these analyses, so I would like to get a clearer idea how those compare before things get even further over my head.

I am aware of a couple relevant excerpts from the tutorial:
 ---------
"In this example, we have used the cfa() function. Other functions in the lavaan package are sem() and growth() for fitting full structural equation models and growth curve models respectively. All three functions are so-called user-friendly functions, in the sense that they take care of many details automatically, so we can keep the model syntax simple and concise. If you wish to fit non-standard models or if you don’t like the idea that things are done for you automatically, you can use the lower-level function lavaan() instead, where you have full control. "

"The function sem() is very similar to the function cfa(). In fact, the two functions are currently almost identical, but this may change in the future."
------------
Thanks for any help!

Edward Rigdon

unread,
Jun 14, 2025, 11:47:18 AM6/14/25
to lav...@googlegroups.com
The cfa() and sem() functions impose defaults while the lavaan function minimizes defaults. This slide from a presentation summarizes the defaults associated with cfa() and sem() but not lavaan(). So the problem is that, with cfa() and sem(), the actual specification of the model is much more elaborate than your model syntax, but with lavaan(), your model syntax is closer to being the full extent of model specification.
Broadly, I would use cfa() or sem() unless (a) their defaults were somehow interfering with your model specification, or (b) you enjoy pain.
cfa and sem vs lavaan.pptx

Jeremy Miles

unread,
Jun 14, 2025, 2:13:32 PM6/14/25
to lav...@googlegroups.com
I must enjoy pain. :)

I'm always a bit anxious about the defaults, I prefer to use lavaan() and specify everything - that way I know what is happening. (I also don't use things like the growth() function.

On Sat, Jun 14, 2025, 8:47 AM Edward Rigdon <edward...@gmail.com> wrote:
The cfa() and sem() functions impose defaults while the lavaan function minimizes defaults. This slide from a presentation summarizes the defaults associated with cfa() and sem() but not lavaan(). So the problem is that, with cfa() and sem(), the actual specification of the model is much more elaborate than your model syntax, but with lavaan(), your model syntax is closer to being the full extent of model specification.
Broadly, I would use cfa() or sem() unless (a) their defaults were somehow interfering with your model specification, or (b) you enjoy pain.

--
You received this message because you are subscribed to the Google Groups "lavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lavaan+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/lavaan/CAHxMgedpC4oST18F0%3DMhTij8%2BJjhoJ1BNvSeOmNWyDTLBFaD5Q%40mail.gmail.com.

Pierce Ekstrom

unread,
Jun 16, 2025, 5:02:37 PM6/16/25
to lavaan
Thank you both!

Reviewing the slide that Edward sent, I think the specific problem with my initial code:

testmodel <- '
  y ~ x1 + x2
  y ~ 1
'
summary(lavaan(testmodel, data = data))

was that it did not estimate y's residual variance? 

As I mentioned, substituting sem() for lavaan in the above caused an (apparently) reasonable model to be estimated. And as Edward pointed out, one of the defaults sem imposes is "every residual or error term gets a variance."

I was able to get a result that appears identical to that of sem(testmodel) when I instead wrote:
testmodel <- '
y ~ x1 + x2
y ~ 1
y ~~ y
'
summary(lavaan(testmodel, data = data))


Because, if I understand correctly, adding y~~y instructed Lavaan to estimate the (residual) variance for y...?

Shu Fai Cheung (張樹輝)

unread,
Jun 16, 2025, 7:02:19 PM6/16/25
to lavaan
I can't speak for the developers of lavaan, but I would like to share my view from developing a package.

The function lavaan(), I believe, is intended to be the main (final) function for model fitting. That is, it is the "real" function doing this job. The "final stop" before starting the SEM analysis. It also has some default values for some arguments (like many R functions).

However, there are cases in which some default values may not be appropriate for some models. It would be inconvenient to keep changing these default values every time we fit those models. Therefore, some "wrappers" were developed, such as sem(), cfa(), and growth(). These wrappers have default values for some arguments that are different from lavaan() (e.g., see the help page of https://rdrr.io/cran/lavaan/man/sem.html). They are useful because we don't have to keep changing them when calling lavaan(), and can simplify things because things that we need to do manually by the model syntax for these models when using lavaan() directly are then handled automatically by those default values of the warppers.

The lavaan() function may or may not be intended for everyday use when it was first written. I am not sure because I didn't use lavaan when it was first released. However, for the last few years, I have never encountered cases in which I need to teach students to use this function. The wrappers are intended to prevent users from making mistakes when fitting common models, such as forgetting to add something they should. I didn't mention this function to prevent any confusion. If necessary, I would tell my students not to use it directly, unless they know what they are doing.

Default behaviors are present for many SEM programs. For example, in AMOS, when we draw a diagram, covariances between exogenous variables are not included automatically. Users have to add them. (Missing those covariances seems to be quite common for new learners of AMOS, in my experience.)

It does not mean there is something wrong with lavaan() or its default values. We just can't, and don't have to, make a function work for all possible scenarios (models). Practically, we just set what are common, but then write wrappers when necessary.

I oversimplified the idea (and I have no formal training as a programmer and so I may be wrong on some aspects; please correct me if I made any mistakes about the practice and concept). Wrappers are common in programming, and they sometimes do more than just set the default values. They may preprocess the input objects before calling the main functions and/or postprocess the output objects returned by the main functions:

https://en.wikipedia.org/wiki/Wrapper_function

Many common R functions are wrappers. For example, sapply() is a wrapper of lapply (and a simple one; type sapply to see the source code), and replicate() is a wrapper of sapply(). The function log10() is a wrapper of log(). The function for ANOVA, aov(), is also a wrapper of lm(), the function to do regression.

My personal suggestion: If you are new to lavaan and not sure what to use, simply don't use lavaan() directly, until you really believe you have to.

(Jeremy may disagree, and I am happy to learn about different opinions. :>)

There are cases in which lavaan() can or should be used directly. For example, when refitting a model in exactly the way we want it, when we only have the fitted object (the output of lavaan()) and not the model syntax.

However, those cases are usually in developing things to work with lavaan, not for fitting models in research.

My two cents.

-- Shu Fai

P.S.: A technical note. Despite what the help pages of sem() and cfa() say, internally, they no longer just set the values for some arguments and then call lavaan(), as they were in the past. However, they are still wrappers.

Shu Fai Cheung (張樹輝)

unread,
Jun 16, 2025, 7:25:54 PM6/16/25
to lavaan
On second thought, I might have misunderstood your need. You may be interested in learning exactly how lavaan works. If this is what you want, the best way, but also a hard way, to learn about the model actually fitted is by inspecting the parameter table (not the result of parameterEstimates(), which is similar in look is a different thing).

This is an illustraton:

``` r
library(lavaan)
#> This is lavaan 0.6-19
#> lavaan is FREE software! Please report any bugs.

set.seed(271828)
n <- 100
x1 <- rnorm(n)
x2 <- rnorm(n)
y <- rnorm(n)
dat <- data.frame(x1, x2, y)

testmodel1 <- '

  y ~ x1 + x2
  y ~ 1
'
fit1 <- lavaan(testmodel1, data = dat)
#> Warning: lavaan->lav_model_estimate():  
#>    initial model-implied matrix (Sigma) is not positive definite; check your
#>    model and/or starting parameters .
#> Warning: lavaan->lav_model_estimate():  
#>    initial model-implied matrix (Sigma) is not positive definite; check your
#>    model and/or starting parameters .
#> Warning: lavaan->lav_model_estimate():  
#>    initial model-implied matrix (Sigma) is not positive definite; check your
#>    model and/or starting parameters .
#> Warning: lavaan->lav_model_estimate():  
#>    initial model-implied matrix (Sigma) is not positive definite; check your
#>    model and/or starting parameters .
#> Warning: lavaan->lav_lavaan_step11_estoptim():  
#>    Model estimation FAILED! Returning starting values.

testmodel2 <- '

  y ~ x1 + x2
  y ~ 1
  y ~~ y
'
fit2 <- lavaan(testmodel2, data = dat)

parameterTable(fit1)
#>   id lhs op rhs user block group free ustart exo label plabel  start    est se
#> 1  1   y  ~  x1    1     1     1    1     NA   0         .p1.  0.109  0.109 NA
#> 2  2   y  ~  x2    1     1     1    2     NA   0         .p2.  0.128  0.128 NA
#> 3  3   y ~1        1     1     1    3     NA   0         .p3.  0.001  0.001 NA
#> 4  4   y ~~   y    0     1     1    0      0   0         .p4.  0.000  0.000  0
#> 5  5  x1 ~~  x1    0     1     1    0     NA   1         .p5.  0.849  0.849  0
#> 6  6  x1 ~~  x2    0     1     1    0     NA   1         .p6. -0.042 -0.042  0
#> 7  7  x2 ~~  x2    0     1     1    0     NA   1         .p7.  0.896  0.896  0
#> 8  8  x1 ~1        0     1     1    0     NA   1         .p8. -0.017 -0.017  0
#> 9  9  x2 ~1        0     1     1    0     NA   1         .p9. -0.104 -0.104  0
parameterTable(fit2)
#>   id lhs op rhs user block group free ustart exo label plabel  start    est
#> 1  1   y  ~  x1    1     1     1    1     NA   0         .p1.  0.109  0.109
#> 2  2   y  ~  x2    1     1     1    2     NA   0         .p2.  0.128  0.128
#> 3  3   y ~1        1     1     1    3     NA   0         .p3.  0.001  0.001
#> 4  4   y ~~   y    1     1     1    4     NA   0         .p4.  0.924  0.924
#> 5  5  x1 ~~  x1    0     1     1    0     NA   1         .p5.  0.849  0.849
#> 6  6  x1 ~~  x2    0     1     1    0     NA   1         .p6. -0.042 -0.042
#> 7  7  x2 ~~  x2    0     1     1    0     NA   1         .p7.  0.896  0.896
#> 8  8  x1 ~1        0     1     1    0     NA   1         .p8. -0.017 -0.017
#> 9  9  x2 ~1        0     1     1    0     NA   1         .p9. -0.104 -0.104
#>      se
#> 1 0.104
#> 2 0.102
#> 3 0.097
#> 4 0.131
#> 5 0.000
#> 6 0.000
#> 7 0.000
#> 8 0.000
#> 9 0.000
```

<sup>Created on 2025-06-17 with [reprex v2.1.1](https://reprex.tidyverse.org)</sup>


To my understanding, internally, the "model" is defined by the parameter table, shown above. What lavaan() and sem() do is simply "parsing" the model syntax and converting it to a parameter table, taking into account the values of other arguments. Once converted to a parameter table, the model syntax is no longer used in model fitting. 

Unfortunately, I am not aware of any official documentation on the columns and values in the parameter table. You may need to guess what they are, or do some experiments.

The conversion from the model syntax to the parameter table is not done by lavaan() nor by sem() (and cfa()), but by other functions. Therefore, if you are interested in learning more about how a model syntax is interpreted, you may want to learn more about this function:


P.S.: This function is for developers, not for users, as far as I know. If your goal is just to fit a model correctly, you do not need this function. Just let sem() (or lavaan()) do the job. Nevertheless, if you are interested in learning more about how lavaan works, learning about this function is useful.

-- Shu Fai

On Tuesday, June 17, 2025 at 5:02:37 AM UTC+8 Pierce Ekstrom wrote:

Yves Rosseel

unread,
Jun 17, 2025, 8:09:14 AM6/17/25
to lav...@googlegroups.com
The whole idea of lavaan() is that it does not do anything
automagically. Which may or may not be what you like/prefer/want.

The rationale behind this explained in the lavaan paper:

https://www.jstatsoft.org/article/view/v048i02

Personally, I also tend to use the lavaan() function, typically in
combination with auto.var = TRUE.

But for beginners, the sem() function is much safer!

Yves.

--
Yves Rosseel
Department of Data Analysis, Ghent University

Pierce Ekstrom

unread,
Jun 24, 2025, 11:27:40 AM6/24/25
to lav...@googlegroups.com
I just realized I neglected to respond to these last comments.

Thanks, everyone, for this useful context!

--
You received this message because you are subscribed to a topic in the Google Groups "lavaan" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lavaan/0gn9rpOhYs0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to lavaan+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/lavaan/c57bd1ae-9d92-4563-a904-e4eb3cdd1b67%40gmail.com.
Reply all
Reply to author
Forward
0 new messages