Factor score prediction: How to link the factor scores correctly back to the cases in your data set?

1,998 views
Skip to first unread message

Jessica Fritz

unread,
May 10, 2016, 7:21:23 AM5/10/16
to lavaan
Hi everyone,

I am performing a hierarchical CFA in lavaan and I predicted factor scores for one of the factors, which I would like to use in another model. Therefore, it is highly impornat that I link the factor scores back to the the correct 'participant-IDs'.

I utilized the following code to predict the factor scores.

#FactorScores <- predict(fit_Model)

However lavaan calculates 1218 facor scores and I have 1403 participants in my data set. To make sure that I link the predicted factor scores to the right participans I utlized the lavInspect() function to retrieve the cases for which lavaan has calculated factor scores. However, I am a bit concerned that this approach might not be appropriate.

#cases <- lavInspect(fit_Model, what = "case.idx", add.labels = TRUE, add.class = TRUE, drop.list.single.group = TRUE)

I am fairly new to lavaan, so I was wondring whether someone who is more experienced might have a better idea or a better approach to link the predicted factor scores back to the cases in the data set (considering that lavaan has not predicted factor scores for all participants)?

I would highly appreciate any ideas and advice!

Thank you very much in advance.

Best wishes,
Jes

Terrence Jorgensen

unread,
May 11, 2016, 4:51:37 AM5/11/16
to lavaan
it is highly impornat that I link the factor scores back to the the correct 'participant-IDs'.  I utilized the following code to predict the factor scores.

#FactorScores <- predict(fit_Model)

However lavaan calculates 1218 facor scores and I have 1403 participants in my data set.

If you are using FIML for missing data, lavaan will return NAs for the appropriate rows, but you would still have fewer factor scores than rows if there are any rows that are missing all variables in the model.  If that is the case, it would be easier to create a copy of the data set without those rows:

modelVarNames <- fit_Model@pta$vnames$ov
allAreMissing <- apply(originalData[ , modelVarNames], 1, function(x) all(is.na(x)) )
myData <- originalData[ !allAreMissing , ]

Then (assuming you have a single-group model) use the approach linked here.  If you have a multiple-group model, predict() will return a list of data.frames (one per group) instead of a single data frame with a column for the grouping variable (as you saw in this post).  So you could sort the data by the grouping variable, then fit your model and save the output of predict(..., newdata), then stack the groups factor scores and merge them with the data.frame

myData <- myData[ order(myData$groupVariable) , ]
fit_Model <- lavaan(...)
FactorScores <- do.call(rbind, predict(fit_Model, newdata = myData))
myData <- cbind(myData, FactorScores)


Terrence D. Jorgensen
Postdoctoral Researcher, Methods and Statistics
Research Institute for Child Development and Education, the University of Amsterdam

Jessica Fritz

unread,
May 11, 2016, 6:00:41 AM5/11/16
to lavaan
Dear Terrence,

Thank you so much for your help!

I actually just saw your reaction a few minutes ago and immediately tried your approach.

I unfortunately get the following error message:
      
allAreMissing <- apply(originalData[ , modelVarNames], 1, function(x) all(is.na(x)))
Error in .subset(x, j) : invalid subscript type 'list'


I do not really know what R wants to tell me and how to solve it. I tried to save the modelVarNames as vector/matrix/data.frame, which is probably not appropriate but I always get the same error. Do you have any suggestion?

Thank you so much for your help.

By the way I fortunately do not have multiple groups. But I think that I cannot use FIML because I need to use WLSMV and for na.rm = fiml you need a ML estimation approach (as far as I know), is that correct?

Again, thank you for your help!

Jes

Jessica Fritz

unread,
May 11, 2016, 6:55:37 AM5/11/16
to lavaan
I tried to come up with something myself, and I have now the following approach:

modelVarNames <- fit_HCFA_V_T1_cat_fs@pta$vnames$ov; modelVarNames
Data_id <- data.frame('all variables named in modelVarNames' + id) # manualy entered all variables that were retrieved by modelVarNames, plus id
CompleteData_id <- na.omit(Data_id)
CompleteData_id$fs <-  predict(fit_Model, CompleteData_id)
CompleteData_id <- rename(CompleteData_id, id = data.id)
Full_data <- merge(data, CompleteData_id, by = "id")

I think this approach should work, however, then I end up with a data set that only contains information for all people who have had scores on all model variables. SO a lot of information gets lost. That is not optimal actually but I cannot think about anything how to merge the factor scores correctly into the full data set without deleting any information.

The approach you suggested, Terrence, was for applying FIML estimation, right?

I haven't used FIML, but WLSMV so I eventualy utilized na.omit, because I thought that if I use WLSMV approach, that only complete data will be incorporated in the model.

Is there maybe anyone who could confirm that my applied approach works?

Thank you !

Terrence Jorgensen

unread,
May 12, 2016, 3:19:37 AM5/12/16
to lavaan
I unfortunately get the following error message:
      
allAreMissing <- apply(originalData[ , modelVarNames], 1, function(x) all(is.na(x)))
Error in .subset(x, j) : invalid subscript type 'list'

Woops, I forgot to extract the first list element so that you get a vector of variable names.  Try this, and the rest should work:

modelVarNames <- fit_Model@pta$vnames$ov[[1]]

By the way I fortunately do not have multiple groups. But I think that I cannot use FIML because I need to use WLSMV and for na.rm = fiml you need a ML estimation approach (as far as I know), is that correct?

Correct, you cannot use FIML if you need to use a least-squares estimator.  You can set the argument missing = "pairwise" to use all available information when calculating the polychoric correlations to which the model will be fit.

Yves Rosseel

unread,
May 12, 2016, 3:42:37 AM5/12/16
to lav...@googlegroups.com
On 05/12/2016 09:19 AM, Terrence Jorgensen wrote:
> modelVarNames <- fit_Model@pta$vnames$ov[[1]]

I would avoid accessing lavaan's internal slots. Everything in the
pta$vnames slot can be extracted using the lavNames() function. In this
case:

modelVarNames <- lavNames(fit_Model, "ov")

Yves.

Terrence Jorgensen

unread,
May 12, 2016, 4:40:18 AM5/12/16
to lavaan
Everything in the pta$vnames slot can be extracted using the lavNames() function.

Awesome, I was looking for something like that on the lavInspect page.  I'm glad to know about the lavNames function now

Terry

Jessica Fritz

unread,
May 12, 2016, 8:14:08 AM5/12/16
to lavaan
Great! Thank you very much Yves and Terry! This was very helpful!

Just to double check the approach: If I just use the missing argument as its default together with the WLSMV estimator then lavaan only uses complete cases, right? So this would men that using  na.omit(data) would give me exactly the same data matrix that was used by lavaan to estimate the model, right? (at least this gives me the same amount of observations)

I also have another question, which I think has probably a very logical reason but at this very moment I am a bit confused over my results.

I tried to run a hierarchical CFA (two levels; first level 5 latent factors and second level 1 latent factor) I actually get exactly the same results when running the same code once with orthogonal = TRUE and once with orthogonal = FALSE (also exactly the same amount of df).

I am afraid that there is a highly reasonable answer but I just do not come up ith it at the moment?! What is the correct approach for hierarchical CFA models
1: letting first order factors correlate (orthogonal = FALSE) and estimating an overall second order factor based on the correlations
2: or NOT calculating correlations between first order latent factors  (orthogonal = TRUE) ans estimating an overall second order factor representation the correlations.

I am sorry for posting a probably very obvious question here.

I hope someone has an answer for me?

Thank you very much!

Best,

Jes

Yves Rosseel

unread,
May 12, 2016, 8:22:52 AM5/12/16
to lav...@googlegroups.com
On 05/12/2016 02:14 PM, Jessica Fritz wrote:
> Just to double check the approach: If I just use the missing argument as
> its default together with the WLSMV estimator then lavaan only uses
> complete cases, right?

Yes.

> So this would men that using na.omit(data) would
> give me exactly the same data matrix that was used by lavaan to estimate
> the model, right? (at least this gives me the same amount of observations)

Yes.

> I also have another question, which I think has probably a very logical
> reason but at this very moment I am a bit confused over my results.
>
> I tried to run a hierarchical CFA (two levels; first level 5 latent
> factors and second level 1 latent factor) I actually get exactly the
> same results when running the same code once with orthogonal = TRUE and
> once with orthogonal = FALSE (also exactly the same amount of df).

In a second-order CFA, the first-order latent variables are 'middle-man'
variables. They are neither purely 'endogenous', or purely 'exogenous',
ie they are not listed in the output of lavNames(fit, "lv.y") or
lavNames(fit, "lv.x"). For these cases, lavaan -by default- does not add
correlations between the residual variances. Therefore, orthogonal =
TRUE has not effect, because there are no correlations between factors
to fix to zero. They are already zero.

Yves.

Jessica Fritz

unread,
May 12, 2016, 8:33:56 AM5/12/16
to lavaan
Thank you so much Yves! This sounds ineed very logical! Thank you very much!

And is it generally speaking right to have no corelations between these between layer (endogenous/exogenous) variables or would you recommen addng correlations between these variables?

I first had no correlations because I assumed the correlations to be incorporated in the estimation of a second order factor, but know I think that I have come across some example where correlations have ben addd at the between layer. I am only not quite sure which is statistically correct for hierarchical /higher-order models.

Thank you very much for your advice and support!

Marcos Angelini

unread,
May 12, 2016, 8:38:07 AM5/12/16
to lavaan
Nice we have this function now. Yves, would be possible to specify the order of the variables in lavaan for future use (such as semPlot)? It would be nice to provide a vector with those names somewhere.
 
Thanks,
Marcos
 
 
Yves.

Yves Rosseel

unread,
May 12, 2016, 8:43:24 AM5/12/16
to lav...@googlegroups.com
On 05/12/2016 02:33 PM, Jessica Fritz wrote:
> Thank you so much Yves! This sounds ineed very logical! Thank you very much!
>
> And is it generally speaking right to have no corelations between these
> between layer (endogenous/exogenous) variables

Yes. After all, the general factor is supposed to 'explain' the
correlations among the first-order factors.

Yves.

Jessica Fritz

unread,
May 12, 2016, 8:45:19 AM5/12/16
to lavaan
I also have another question, also with regard to the hierarchical CFA I am trying to fit. I am so sorry for all this questions!

I tried to utilize categorical variables and utilized the following sheets as template: http://www.personality-project.org/r/tutorials/summerschool.14/rosseel_sem_cat.pdf
As described in the sheets I declared 20 items as 'ordered factors' and utilized them as manifest variables. I expected lavaan to calculate polychoric corrrelations between them, beacuse as far as I know you cannot add a correlation matrix yourself, but you need to let lavaan know that you want to use ordinal ('ordered factors') variables?

However now I get the following warnings (see below), even after utilizing only the complete cases of my data. Do you maybe have an idea what I am doing wrong?

Data_T3_cat_Full <- na.omit(Data_T3_cat)
HCFA_V_T3_cat <- 'model(I left the syntax out here because it is very long, it it helps to have it I can of course also post it!'
fit_HCFA_V_T3_cat <- cfa(HCFA_V_T3_cat, data = Data_T3_cat_Full, estimator = "WLSMV")
There were 50 or more warnings (use warnings() to see the first 50
1: In pc_cor_TS(fit.y1 = FIT[[i]], fit.y2 = FIT[[j]], method = optim.method,  ... :
  lavaan WARNING: empty cell(s) in bivariate table of B_data.t3_friend_i2c x B_data.t3_friend_i1c


Yves Rosseel

unread,
May 12, 2016, 8:46:29 AM5/12/16
to lav...@googlegroups.com
On 05/12/2016 02:38 PM, Marcos Angelini wrote:
> modelVarNames <- lavNames(fit_Model, "ov")
>
> Nice we have this function now.

It has been there for several years. Apparently, I failed to communicate
this.

> Yves, would be possible to specify the
> order of the variables in lavaan for future use (such as semPlot)? It
> would be nice to provide a vector with those names somewhere.

I am not sure what you mean. The order is determined by your syntax. So
if you need a different order, simply list them in a different order in
your syntax.

But perhaps I am missing the point here.

Yves.

Jessica Fritz

unread,
May 12, 2016, 8:48:23 AM5/12/16
to lavaan
Thank you so much! I was very worried about my approach and I am enormously glad about your advice! Many thanks!

Yves Rosseel

unread,
May 12, 2016, 8:49:45 AM5/12/16
to lav...@googlegroups.com
> There were 50 or more warnings (use warnings() to see the first 50
>
> 1: In pc_cor_TS(fit.y1 = FIT[[i]], fit.y2 = FIT[[j]], method = optim.method, ... :
> lavaan WARNING: empty cell(s) in bivariate table of B_data.t3_friend_i2c x B_data.t3_friend_i1c

You are not doing anything wrong. They are not errors, just warnings. It
means that (many) two-way frequency tables contain missing cells.

This may be perfectly harmless, but it never hurts to have a look at
those two-way tables. You can use the lavTables() function to explore this.

Yves.

Jessica Fritz

unread,
May 12, 2016, 9:05:52 AM5/12/16
to lavaan
Ok I will do that!

Thank you very very much for your help!

Marcos Angelini

unread,
May 12, 2016, 9:26:51 AM5/12/16
to lavaan
Yes, this is the point. Sometimes it is not possible to give the needed order, for example:
suppose I have the following structural model
Y1 ~ X2 + X3
Y2 ~ X1 + X2 + X4
Then, is is not possible to make the sequence of X as X1, X2, X3, X4. In this case I gave numbers (which could be changed), but in my real case the variables have names and the structure in the graphical model is different from the order that lavaan gives to it.
But of course it is only makeup, so do not worry too much about it.

Kind regards,
Marcos
 

Yves Rosseel

unread,
May 12, 2016, 9:58:15 AM5/12/16
to lav...@googlegroups.com
On 05/12/2016 03:26 PM, Marcos Angelini wrote:
> Yes, this is the point. Sometimes it is not possible to give the needed
> order, for example:
> suppose I have the following structural model
> Y1 ~ X2 + X3
> Y2 ~ X1 + X2 + X4
> Then, is is not possible to make the sequence of X as X1, X2, X3, X4.

Well, you could write

model <- '
Y2 ~ X1
Y1 ~ X2
Y2 ~ X2
Y1 ~ X3
Y2 ~ X4
'

but then the order would be

> lavNames(lavaanify(model))
[1] "Y2" "Y1" "X1" "X2" "X3" "X4"

I understand you may wish to have "Y1", "Y2", "X1", "X2" and so on... I
will think about it (but no promises).

Yves.

Gloria Ma

unread,
Sep 24, 2017, 11:52:25 AM9/24/17
to lavaan
Hi Terrence,

I was trying to use the way you recommended to save the factor scores of multi-group SEM. The codes are in below but I got an error. I am not good at R. Could you please help me check the codes? What can I do to improve them?

q1.model<-"
q1per=~Openness+Extraversion+Conscientiousness+Agreeableness+Honesty_Humility+Depen_total
S_q1out=~S_Inrole_perform+S_Exrole_perform+S_Voice_behav+S_WE+S_Serenity+S_JS
S_q1out~q1per
#+gender+age+actu_hours+supervision+Social_desirability
S_Inrole_perform ~~   S_Exrole_perform
S_WE ~~    S_JS
Agreeableness ~~  Honesty_Humility
"
q1model.fit<-sem(q1.model, data=mydata,missing='fiml', group = "situation")
summary(q1model.fit, fit.measures = TRUE, standardized = TRUE, ci= TRUE)

factorm<-lavNames(q1model.fit, "ov")
allaremissing<-apply(mydata[ ,factorm],1, function(x) all (is.na(x)))
ndata <-mydata[ !allaremissing , ]
ndata<-ndata[ order(ndata$situation) , ]
q1model.fit<-sem(q1.model, data=ndata,missing='fiml', group = "situation")
factorscores<-do.call(rbind,predict(q1model.fit, newdata=ndata)) 
newdata<-cbind(ndata, factorscores)
write.csv(newdata, "q1factorscores.csv")

This is the error that I got:

> newdata<-cbind(ndata, factorscores)
Error in data.frame(..., check.names = FALSE) : 
  arguments imply differing number of rows: 529, 347


Thank you very much for your help!

Best regards,

Gloria
在 2016年5月11日星期三 UTC+2上午10:51:37,Terrence Jorgensen写道:

Terrence Jorgensen

unread,
Sep 26, 2017, 9:16:16 AM9/26/17
to lavaan
Could you please help me check the codes? What can I do to improve them? 

I don't have access to your data, so I used your code, replacing "mydata" with a copy of HolzingerSwineford1939 from lavaan (and some missing data imposed), giving the same name to the model syntax, and saving the "school" variable as "situation".  That way, I could just copy/paste your code using available data.

I had no problem running your syntax:

mydata <- HolzingerSwineford1939[ , paste0("x", 1:9)]
mydata$situation
<- HolzingerSwineford1939$school
set.seed(12345)
mydata$x5
<- ifelse(mydata$x5 <= quantile(mydata$x5, .3), NA, mydata$x5)
age
<- HolzingerSwineford1939$ageyr + HolzingerSwineford1939$agemo/12
mydata$x9
<- ifelse(age <= quantile(age, .3), NA, mydata$x9)

## specify CFA model from lavaan's ?cfa help page
q1
.model <- '
  visual  =~ x1 + x2 + x3
  textual =~ x4 + x5 + x6
  speed   =~ x7 + x8 + x9
'

q1model
.fit <- sem(q1.model, data=mydata,missing='fiml', group = "situation")

factorm
<- lavNames(q1model.fit, "ov")
allaremissing
<- apply(mydata[ , factorm], 1, function(x) all (is.na(x)))
ndata
<- mydata[ !allaremissing , ]
ndata
<- ndata[ order(ndata$situation) , ]
q1model
.fit <- sem(q1.model, data = ndata, missing = 'fiml', group = "situation")

factorscores
<- do.call(rbind, predict(q1model.fit, newdata = ndata))
newdata
<- cbind(ndata, factorscores)

Since your error message is so helpfully informative, perhaps you could investigate it and notice something you did not notice before.  (Perhaps there is an earlier error message that you didn't notice, which prevents later syntax from running without error.)

dim(ndata)
dim
(factorscores)

head
(ndata)
head
(factorscores)

Gloria Ma

unread,
Sep 26, 2017, 10:27:10 AM9/26/17
to lav...@googlegroups.com
Hi Terrence,

Thank you very much for your reply! I just figured out the problem and actually it's a little stupid. The cbind command requires the same rows between two dataset.  Because of the missing data in my group variable, it leaded to the new dataset with different rows. After I subset the data and make sure the group variable had no missing data anymore, the codes worked.

So in my dataset, use  

ndata<- subset (ndata,situation >= 1 ) instead of ndata <- ndata[ order(ndata$situation) , ]

Best regards,

Gloria

--
You received this message because you are subscribed to a topic in the Google Groups "lavaan" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lavaan/E4NPoUiKsks/unsubscribe.
To unsubscribe from this group and all its topics, send an email to lavaan+unsubscribe@googlegroups.com.
To post to this group, send email to lav...@googlegroups.com.
Visit this group at https://groups.google.com/group/lavaan.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages