Error generating factor scores for new data

Gilles Dutilh

unread,

Jun 17, 2020, 1:32:36 PM6/17/20

to lavaan

When I used lavPredict (version 0.6.3) with a new data set, I get the error message:

Error in EETAx[[g]][i, , drop = FALSE] : subscript out of bounds

This error message was already discussed in this thread. I have the same traceback as described in that topic:

lav_predict_eta_ebm_ml(lavobject = lavobject, lavmodel = lavmodel, 
       lavdata = lavdata, lavsamplestats = lavsamplestats, se = se, 
       level = level, data.obs = data.obs, eXo = eXo, ML = FALSE, 
       optim.method = optim.method)

However, I believe there is a different origin to the problem in my case. If I understand correctly, in the other topic, the cause seemed to be missing values in the data set on which the data was fit.

First, I got this message when generating factor scores using only non-missing data.

Second, there is something very peculiar, I noticed: It always happens when I try to calculate more than 2415 scores. It really doesn't matter which 2416 (or more) rows of my data set I take, it only works with 2415 of them.

Christian Tillich-Walker

unread,

Jun 17, 2021, 5:30:26 PM6/17/21

to lavaan

I'm also encountering this issue. For me, I'm able to calculate more than 2,415 scores, but I cannot calculate more scores than the size of my original training set. I don't know much about the inner-workings of lavaan, but I did use `options(error = recover)` to explore and it does seem like whatever EETAx is supposed to be, its length is set to the exact size of my original training data.

I encountered this issue because I fit a lavaan model on some survey respondents, and now I want to take that model and score a new set of respondents and extract the factor scores. The data is a mix of numeric and ordinal variables. Because of issues like this...

https://groups.google.com/g/lavaan/c/OzQbVMhe2Kk

...I cannot just score a new data set unless each level from all categorical variables is represented in that set. So I appended the data I want to score to the original data set. But this bug appears to keep us from scoring a set that is longer than the original data set.

I guess I could try generating totally fake data such that each factor level is represented, then appending the new data. Or I could try replacing just as many entries from the old data as there is new data each time I want to score new records and hope that all levels of ordinals are still represented. These are not obvious workarounds, though, so if there's a simple fix or a workaround I'm not seeing please let me know.

Yves Rosseel

unread,

Jun 18, 2021, 4:02:27 AM6/18/21

to lav...@googlegroups.com

Hello Christian,

Would you be able to create a minimal reprex and email it to me? Thanks.

Yves.

> ErrorinEETAx[[g]][i,,drop =FALSE]:subscript outof bounds

> |
>
>
> This error message was already discussed in this thread

> <https://groups.google.com/d/msg/lavaan/8bhErhigpfU/fka2KfxeBQAJ>. I

> have the same traceback as described in that topic:
> |
> lav_predict_eta_ebm_ml(lavobject =lavobject,lavmodel =lavmodel,
> lavdata =lavdata,lavsamplestats =lavsamplestats,se =se,
> level =level,data.obs =data.obs,eXo =eXo,ML =FALSE,
> optim.method =optim.method)
> |
>
>
> However, I believe there is a different origin to the problem in my
> case. If I understand correctly, in the other topic, the cause
> seemed to be missing values in the data set on which the data was fit.
>
> First, I got this message when generating factor scores using only
> non-missing data.
> Second, there is something very peculiar, I noticed: It always
> happens when I try to calculate more than 2415 scores. It really
> doesn't matter which 2416 (or more) rows of my data set I take, it
> only works with 2415 of them.
>
>

> --
> You received this message because you are subscribed to the Google
> Groups "lavaan" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to lavaan+un...@googlegroups.com
> <mailto:lavaan+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/lavaan/2bfe9580-f1fe-402f-abca-b6374eaba8e1n%40googlegroups.com
> <https://groups.google.com/d/msgid/lavaan/2bfe9580-f1fe-402f-abca-b6374eaba8e1n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Christian Tillich-Walker

unread,

Jun 18, 2021, 4:04:20 PM6/18/21

to lavaan

I can't seem to post it as an attachment. Let's try this:

```

library(magrittr)

library(dplyr)

library(lavaan)

#' Okay, so, let's start with a basic 1-factor SEM/CFA off the iris data set. We're

#' going to set up a categorical variable but treat it as numeric to start. What

#' I'm expecting (and please correct me if this expectation is incorrect) is that

#' lavPredict could score the original training set of any SEM model and reproduce

#' the latent scores as if the entire training set was scored. And I start by

#' showing that this is true for the all-numeric case.

df <- iris %>%

mutate(species_fac = as.numeric(factor(Species))) %>%

select(-Species)

mdl <- sem(

'factor =~ Petal.Length + Petal.Width + species_fac'

, data = df

)

score_from_training <- lavPredict(mdl)[150,]

# The goal is to score a new record. We can't do that directly - Lavaan throws

# an error here even as numeric, species_fac has no variance.

lavPredict(mdl, newdata = df[150, ])

# One suggestion was to try appending new records to the original training set.

# This produces the EETAx error. For lavPredict, newdata cannot be larger than

# the original training set.

lavPredict(mdl, newdata = bind_rows(df, df[150, ]))

#okay so that's fine, I want to score just a single row. Let's artificially

#inject some variance by creating a synthetic row. We'll score both but keep

#only the actual data

synth <- data.frame(

Sepal.Length = 1

,Sepal.Width = 1

,Petal.Length = 1

,Petal.Width = 1

,species_fac = 1

) %>% bind_rows(df[150,])

possible_new_record <- lavPredict(mdl, newdata = synth)[2,]

# In the numeric case, this works. I get an exact match.

score_from_training == possible_new_record

# But this is not the case for the model with ordered observables.

mdl <- sem(

'factor =~ Petal.Length + Petal.Width + species_fac'

, data = df

, ordered = 'species_fac'

)

score_from_training <- lavPredict(mdl)[150,]

possible_new_record <- lavPredict(mdl, newdata = synth)[2,]

# Very different scores

score_from_training == possible_new_record

```

Yves Rosseel

unread,

Jun 19, 2021, 5:59:59 AM6/19/21

to lav...@googlegroups.com

Hello Christian,

Can you try this again with the current version of lavaan (0.6-8)? This
version should work when newdata= contains a single observation, at
least in the numerical case. For the categorical case, this does not
work (yet).

Yves.

On 6/18/21 10:04 PM, Christian Tillich-Walker wrote:
> I can't seem to post it as an attachment. Let's try this:
>
> ```
> library(magrittr)
> library(dplyr)
> library(lavaan)
>
> #' Okay, so, let's start with a basic 1-factor SEM/CFA off the iris data
> set. We're
> #' going to set up a categorical variable but treat it as numeric to
> start. What
> #' I'm expecting (and please correct me if this expectation is
> incorrect) is that
> #' lavPredict could score the original training set of any SEM model and
> reproduce
> #' the latent scores as if the entire training set was scored. And I
> start by
> #' showing that this is true for the all-numeric case.
> df<- iris%>%

> mutate(species_fac= as.numeric(factor(Species))) %>%

> select(-Species)
>
> mdl<- sem(
> 'factor =~ Petal.Length + Petal.Width + species_fac'

> , data= df

> )
> score_from_training<- lavPredict(mdl)[150,]
>
> # The goal is to score a new record. We can't do that directly - Lavaan
> throws
> # an error here even as numeric, species_fac has no variance.

> lavPredict(mdl, newdata= df[150, ])

>
> # One suggestion was to try appending new records to the original
> training set.
> # This produces the EETAx error. For lavPredict, newdata cannot be
> larger than
> # the original training set.

> lavPredict(mdl, newdata= bind_rows(df, df[150, ]))

>
> #okay so that's fine, I want to score just a single row. Let's artificially
> #inject some variance by creating a synthetic row. We'll score both but
> keep
> #only the actual data
> synth<- data.frame(
> Sepal.Length= 1
> ,Sepal.Width= 1
> ,Petal.Length= 1
> ,Petal.Width= 1

> ,species_fac= 1

> ) %>% bind_rows(df[150,])

> possible_new_record<- lavPredict(mdl, newdata= synth)[2,]

>
> # In the numeric case, this works. I get an exact match.

> score_from_training== possible_new_record

>
>
> # But this is not the case for the model with ordered observables.
> mdl<- sem(
> 'factor =~ Petal.Length + Petal.Width + species_fac'
> , data= df

> , ordered= 'species_fac'

> )
>
> score_from_training<- lavPredict(mdl)[150,]

> possible_new_record<- lavPredict(mdl, newdata= synth)[2,]

>
> # Very different scores
> score_from_training== possible_new_record
> ```
>
>
>
>
>
>
>

> --
> You received this message because you are subscribed to the Google
> Groups "lavaan" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to lavaan+un...@googlegroups.com
> <mailto:lavaan+un...@googlegroups.com>.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/lavaan/e68af48d-3c66-485a-8076-62b55bf996cen%40googlegroups.com
> <https://groups.google.com/d/msgid/lavaan/e68af48d-3c66-485a-8076-62b55bf996cen%40googlegroups.com?utm_medium=email&utm_source=footer>.

Christian Tillich-Walker

unread,

Jun 19, 2021, 11:28:50 AM6/19/21

to lavaan

Sorry about that, I thought I had the most recent version but that may be in a different computing environment now that I'm thinking about it. I'm on 0.6.8 now and I can do single-row numeric scoring. For ordered scoring, single row doesn't work, but I can append to the original data and rescore and that appears to give the right results. I'll use that as a workaround for now.

Thanks again Yves.

Reply all

Reply to author

Forward