Error generating factor scores for new data

110 views
Skip to first unread message

Gilles Dutilh

unread,
Jun 17, 2020, 1:32:36 PM6/17/20
to lavaan
When I used lavPredict (version 0.6.3) with a new data set, I get the error message:

Error in EETAx[[g]][i, , drop = FALSE] : subscript out of bounds


This error message was already discussed in this thread. I have the same traceback as described in that topic:
lav_predict_eta_ebm_ml(lavobject = lavobject, lavmodel = lavmodel,
       lavdata
= lavdata, lavsamplestats = lavsamplestats, se = se,
       level
= level, data.obs = data.obs, eXo = eXo, ML = FALSE,
       optim
.method = optim.method)


However, I believe there is a different origin to the problem in my case. If I understand correctly, in the other topic, the cause seemed to be missing values in the data set on which the data was fit.

First, I got this message when generating factor scores using only non-missing data.
Second, there is something very peculiar, I noticed: It always happens when I try to calculate more than 2415 scores. It really doesn't matter which 2416 (or more) rows of my data set I take, it only works with 2415 of them.


Christian Tillich-Walker

unread,
Jun 17, 2021, 5:30:26 PM6/17/21
to lavaan
I'm also encountering this issue. For me, I'm able to calculate more than 2,415 scores, but I cannot calculate more scores than the size of my original training set. I don't know much about the inner-workings of lavaan, but I did use `options(error = recover)` to explore and it does seem like whatever EETAx is supposed to be, its length is set to the exact size of my original training data.

I encountered this issue because I fit a lavaan model on some survey respondents, and now I want to take that model and score a new set of respondents and extract the factor scores. The data is a mix of numeric and ordinal variables. Because of issues like this...


...I cannot just score a new data set unless each level from all categorical variables is represented in that set. So I appended the data I want to score to the original data set. But this bug appears to keep us from scoring a set that is longer than the original data set.

I guess I could try generating totally fake data such that each factor level is represented, then appending the new data. Or I could try replacing just as many entries from the old data as there is new data each time I want to score new records and hope that all levels of ordinals are still represented. These are not obvious workarounds, though, so if there's a simple fix or a workaround I'm not seeing please let me know.

Yves Rosseel

unread,
Jun 18, 2021, 4:02:27 AM6/18/21
to lav...@googlegroups.com
Hello Christian,

Would you be able to create a minimal reprex and email it to me? Thanks.

Yves.
> ErrorinEETAx[[g]][i,,drop =FALSE]:subscript outof bounds
> |
>
>
> This error message was already discussed in this thread
> <https://groups.google.com/d/msg/lavaan/8bhErhigpfU/fka2KfxeBQAJ>. I
> have the same traceback as described in that topic:
> |
> lav_predict_eta_ebm_ml(lavobject =lavobject,lavmodel =lavmodel,
>        lavdata =lavdata,lavsamplestats =lavsamplestats,se =se,
>        level =level,data.obs =data.obs,eXo =eXo,ML =FALSE,
>        optim.method =optim.method)
> |
>
>
> However, I believe there is a different origin to the problem in my
> case. If I understand correctly, in the other topic, the cause
> seemed to be missing values in the data set on which the data was fit.
>
> First, I got this message when generating factor scores using only
> non-missing data.
> Second, there is something very peculiar, I noticed: It always
> happens when I try to calculate more than 2415 scores. It really
> doesn't matter which 2416 (or more) rows of my data set I take, it
> only works with 2415 of them.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "lavaan" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to lavaan+un...@googlegroups.com
> <mailto:lavaan+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/lavaan/2bfe9580-f1fe-402f-abca-b6374eaba8e1n%40googlegroups.com
> <https://groups.google.com/d/msgid/lavaan/2bfe9580-f1fe-402f-abca-b6374eaba8e1n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Christian Tillich-Walker

unread,
Jun 18, 2021, 4:04:20 PM6/18/21
to lavaan
I can't seem to post it as an attachment. Let's try this:

```
library(magrittr)
library(dplyr)
library(lavaan)

#' Okay, so, let's start with a basic 1-factor SEM/CFA off the iris data set. We're
#' going to set up a categorical variable but treat it as numeric to start. What
#' I'm expecting (and please correct me if this expectation is incorrect) is that
#' lavPredict could score the original training set of any SEM model and reproduce
#' the latent scores as if the entire training set was scored. And I start by
#' showing that this is true for the all-numeric case.
df <- iris %>%
mutate(species_fac = as.numeric(factor(Species))) %>%
select(-Species)

mdl <- sem(
'factor =~ Petal.Length + Petal.Width + species_fac'
, data = df
)
score_from_training <- lavPredict(mdl)[150,]

# The goal is to score a new record. We can't do that directly - Lavaan throws
# an error here even as numeric, species_fac has no variance.
lavPredict(mdl, newdata = df[150, ])

# One suggestion was to try appending new records to the original training set.
# This produces the EETAx error. For lavPredict, newdata cannot be larger than
# the original training set.
lavPredict(mdl, newdata = bind_rows(df, df[150, ]))

#okay so that's fine, I want to score just a single row. Let's artificially
#inject some variance by creating a synthetic row. We'll score both but keep
#only the actual data
synth <- data.frame(
Sepal.Length = 1
,Sepal.Width = 1
,Petal.Length = 1
,Petal.Width = 1
,species_fac = 1
) %>% bind_rows(df[150,])
possible_new_record <- lavPredict(mdl, newdata = synth)[2,]

# In the numeric case, this works. I get an exact match.
score_from_training == possible_new_record


# But this is not the case for the model with ordered observables.
mdl <- sem(
'factor =~ Petal.Length + Petal.Width + species_fac'
, data = df
, ordered = 'species_fac'
)

score_from_training <- lavPredict(mdl)[150,]
possible_new_record <- lavPredict(mdl, newdata = synth)[2,]

# Very different scores
score_from_training == possible_new_record
```







Yves Rosseel

unread,
Jun 19, 2021, 5:59:59 AM6/19/21
to lav...@googlegroups.com
Hello Christian,

Can you try this again with the current version of lavaan (0.6-8)? This
version should work when newdata= contains a single observation, at
least in the numerical case. For the categorical case, this does not
work (yet).

Yves.


On 6/18/21 10:04 PM, Christian Tillich-Walker wrote:
> I can't seem to post it as an attachment. Let's try this:
>
> ```
> library(magrittr)
> library(dplyr)
> library(lavaan)
>
> #' Okay, so, let's start with a basic 1-factor SEM/CFA off the iris data
> set. We're
> #' going to set up a categorical variable but treat it as numeric to
> start. What
> #' I'm expecting (and please correct me if this expectation is
> incorrect) is that
> #' lavPredict could score the original training set of any SEM model and
> reproduce
> #' the latent scores as if the entire training set was scored. And I
> start by
> #' showing that this is true for the all-numeric case.
> df<- iris%>%
> mutate(species_fac= as.numeric(factor(Species))) %>%
> select(-Species)
>
> mdl<- sem(
> 'factor =~ Petal.Length + Petal.Width + species_fac'
> , data= df
> )
> score_from_training<- lavPredict(mdl)[150,]
>
> # The goal is to score a new record. We can't do that directly - Lavaan
> throws
> # an error here even as numeric, species_fac has no variance.
> lavPredict(mdl, newdata= df[150, ])
>
> # One suggestion was to try appending new records to the original
> training set.
> # This produces the EETAx error. For lavPredict, newdata cannot be
> larger than
> # the original training set.
> lavPredict(mdl, newdata= bind_rows(df, df[150, ]))
>
> #okay so that's fine, I want to score just a single row. Let's artificially
> #inject some variance by creating a synthetic row. We'll score both but
> keep
> #only the actual data
> synth<- data.frame(
> Sepal.Length= 1
> ,Sepal.Width= 1
> ,Petal.Length= 1
> ,Petal.Width= 1
> ,species_fac= 1
> ) %>% bind_rows(df[150,])
> possible_new_record<- lavPredict(mdl, newdata= synth)[2,]
>
> # In the numeric case, this works. I get an exact match.
> score_from_training== possible_new_record
>
>
> # But this is not the case for the model with ordered observables.
> mdl<- sem(
> 'factor =~ Petal.Length + Petal.Width + species_fac'
> , data= df
> , ordered= 'species_fac'
> )
>
> score_from_training<- lavPredict(mdl)[150,]
> possible_new_record<- lavPredict(mdl, newdata= synth)[2,]
>
> # Very different scores
> score_from_training== possible_new_record
> ```
>
>
>
>
>
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "lavaan" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to lavaan+un...@googlegroups.com
> <mailto:lavaan+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/lavaan/e68af48d-3c66-485a-8076-62b55bf996cen%40googlegroups.com
> <https://groups.google.com/d/msgid/lavaan/e68af48d-3c66-485a-8076-62b55bf996cen%40googlegroups.com?utm_medium=email&utm_source=footer>.

Christian Tillich-Walker

unread,
Jun 19, 2021, 11:28:50 AM6/19/21
to lavaan
Sorry about that, I thought I had the most recent version but that may be in a different computing environment now that I'm thinking about it. I'm on 0.6.8 now and I can do single-row numeric scoring. For ordered scoring, single row doesn't work, but I can append to the original data and rescore and that appears to give the right results. I'll use that as a workaround for now.

Thanks again Yves.
Reply all
Reply to author
Forward
0 new messages