Puzzling report of high correlation between two binary variables

87 views
Skip to first unread message

Hugo Harada

unread,
Mar 31, 2022, 10:36:07 AM3/31/22
to lav...@googlegroups.com

Hello all,

I have dataset that once fed into a model, I get the following message indicating that variables i98.t2 and i90.t2 are highly correlated. They are both binary.

Error in nlminb(start = start.x, objective = objective_function, gradient = GRADIENT,  :
  NA/NaN gradient evaluation
Além disso: Warning messages:
1: In lav_samplestats_step2(UNI = FIT, wt = wt, ov.names = ov.names,  :
  lavaan WARNING: correlation between variables i98.t2 and i90.t2 is (nearly) 1.0
2: In muthen1984(Data = X[[g]], wt = WT[[g]], ov.names = ov.names[[g]],  :
  lavaan WARNING: trouble constructing W matrix; used generalized inverse for A22 submatrix

If I remove one of the variables, the estimation runs without a problem.

What is puzzling me is that when I calculate correlation between these two variables, I get something close to 0.30, am I missing something?

> psych::phi(table(data.frame(dat[, c("i90.t2","i98.t2")])))
[1] 0.34
> polycor::hetcor(dat[, c("i90.t2","i98.t2")])

Two-Step Estimates

Correlations/Type of Correlation:
       i90.t2  i98.t2
i90.t2      1 Pearson
i98.t2 0.3386       1

Looking forward to hearing from you,

Hugo

--

Adaptativa

Hugo Harada
Sócio-fundador - COO

Adaptativa Inteligência Educacional S.A.
Cel: (11) 96345-0390
Rua Claudio Soares, 72 - Sala 411Pinheiros, CEP 05422-030, São Paulo - SP 
http://www.adaptativa.com.br

Facebook Twitter Google Plus Youtube

Pat Malone

unread,
Mar 31, 2022, 11:45:04 AM3/31/22
to lav...@googlegroups.com
Hugo,

I would guess the two variables are highly unbalanced? Also, are they declared as ordered, either in the dataframe or in the lavaan call? 

If they are considered ordered in lavaan, then the extremely high correlation is between the underlying continuous latents (in a probit sense), not the observed binaries.

Can you show the 2x2 cross-tab? Also, showing the lavaan call can often help.

Pat

--
You received this message because you are subscribed to the Google Groups "lavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lavaan+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lavaan/CADoPN9FScyu0-hMOyQq1F%2Bs0irMwZ%2B7q-dAiYoqoL_pirFeqog%40mail.gmail.com.


--
Patrick S. Malone, PhD
Sr Research Statistician, FAR HARBΦR
This message may contain confidential information; if you are not the intended recipient please notify the sender and delete the message.

Hugo Harada

unread,
Mar 31, 2022, 12:52:23 PM3/31/22
to lav...@googlegroups.com
Hi Patrick,

Thank you for your attention.

Here is the cross-tab. 
> table(data.frame(dat[, c("i90.t2","i98.t2")]))
      i98.t2
i90.t2   0   1
     0 323 479
     1   1 197

I would guess the two variables are highly unbalanced?

Not sure I understand what worries you here.

And here is the call.

    select_vec <- rep(TRUE, ncol(dat)) #used to select columns passed to the model
    names(select_vec)<- colnames(dat)

    mod.fit <- lavaan(lavaan.model.t123.free.CommonItems,
                      data = dat[,select_vec],
                      int.ov.free = TRUE,
                      int.lv.free = FALSE,
                      meanstructure = TRUE,
                      std.lv =FALSE,
                      auto.fix.first = FALSE,
                      auto.var = TRUE,
                      auto.th = TRUE,
                      auto.delta = TRUE,
                      auto.cov.y = TRUE,
                      auto.var = TRUE,
                      ordered = colnames(dat)[select_vec],
                      parameterization = "theta")

Kind regards,

Hugo

Pat Malone

unread,
Mar 31, 2022, 1:08:53 PM3/31/22
to lav...@googlegroups.com
You didn't say if these variables are declared ordered, but the nearly empty cell is almost certainly your problem. When lavaan tries to estimate the correlation among the underlying variables, it's basing part of it on very little information.

These are for practical purposes a 3-category nominal variable: observations are "i90 only," "i98 only," or "both."

Pat

Hugo Harada

unread,
Mar 31, 2022, 2:24:35 PM3/31/22
to lav...@googlegroups.com
Patrick,

Yes, they are ordered. 

                      ordered = colnames(dat)[select_vec],

These are for practical purposes a 3-category nominal variable: observations are "i90 only," "i98 only," or "both."

Makes total sense! Thank you.

Hugo

Hugo Harada

unread,
Apr 4, 2022, 2:11:09 PM4/4/22
to lav...@googlegroups.com, p...@farharbor.com
Patrick,

Had to review my math to understand your point and you nailed it. For completeness, I will append here some pages from Lord & Novick. See the marks in red.

Thank you. 

Hugo
TetrachoricCorrelations.pdf
Reply all
Reply to author
Forward
0 new messages