Very low Neff when performing GWAS-by-subtraction

266 views
Skip to first unread message

Trine Nielsen

unread,
Jul 13, 2023, 5:46:05 AM7/13/23
to Genomic SEM Users
Hi the GenomicSEM team

I have used GenomicSEM to perform a GWAS-by-subtraction and I will now like to use the GWAS summary statistics for the two latent factors in downstream analyzes where a sample size is required, but I get very low estimates for Neff when I try to calculate these. 

I used the tutorial found here: https://rpubs.com/MichelNivard/565885.
I have used the model: 

model<-'F1=~NA*Pheno1 + NA*Pheno2
         F2=~NA*Pheno1
         
         F1~SNP
         F2~SNP

         F2~~1*F2
         F1~~1*F1
         F1~~0*F2

         Pheno2 ~~ 0*Pheno1
         Pheno2~~0*Pheno2
         Pheno1~~0*Pheno1
         SNP~~SNP'

After, the GWAS was run, I used the following code to calculate the Neffs for the two latent factors, I found the code here: https://github.com/PerlineDemange/non-cognitive/blob/master/GenomicSEM/Cholesky%20model/Calculation_samplesize.R


F1$Neff <- ((F1$Z_Estimate/(F1$est*sqrt(loading)))^2)/(2*F1$MAF*(1-F1$MAF))
summary(F1$Neff)

F1_maf_thresh <- F1[F1$MAF >= .10 & F1$MAF <= .40,]
F1_maf_thresh$N <- mean(F1_maf_thresh$Neff)
F1_maf_thresh$N <- floor(F1_maf_thresh$N)
summary(F1_maf_thresh$N)

For both factors, I get a Neff < 13,000, which is far below the original sample size for my two phenotypes pheno1 and pheno2.

Can you think of any thing that I am doing wrong or can do differently?
Any help would be greatly appreciated

Best regards Trine


agro...@gmail.com

unread,
Jul 13, 2023, 1:00:37 PM7/13/23
to Genomic SEM Users
Hi Trine, 

The sample size for factors in Genomic SEM can be really tricky. It looks like you are specifying the GWAS-by-subtraction correctly and the issue likely relates to the scaling of the factors in this model. At the bottom of this wiki page (https://github.com/GenomicSEM/GenomicSEM/wiki/4.-Common-Factor-GWAS) we write the following about backing out the expected sample size for summary statistics of factors in Genomic SEM: 
"If we were to scale the latent genetic factors with unit variance identification, N_hat would be interpreted relative to a factor that is 100% heritable, and N_hat would be unintuitively very small (because, all else being equal, highly heritable phenotypes require smaller sample sizes to detect genetic associations)."

So, following from this it makes sense you are getting deflated estimates since you have unit variance scaled (100% heritable) factors. It doesn't mean the model is wrong, just that the math for backing out N doesn't work so well for producing meaningful estimates. You could try using unit loading identification and fix the loadings to 1 and make sure you aren't using std.lv=T as an argument so that the factors aren't additionally scaled to have a (genetic) variance of 1. To be honest this might end up throwing an error so would be worth trying to estimate it without the SNP effects first to see if the model is problematic before re-running the GWAS-by-subtraction. If this alternative model specification does give you issues then I'm not sure I can think of a great way to back out N here but perhaps others on the forum have ideas. 

model<-'F1=~1*Pheno2+Pheno1
         F2=~1*Pheno1
         
         F1~SNP
         F2~SNP

         F1~~0*F2

         Pheno2 ~~ 0*Pheno1
         Pheno2~~0*Pheno2
         Pheno1~~0*Pheno1
         SNP~~SNP'

-Andrew

Hasan Alkhairo

unread,
Jul 13, 2023, 2:11:59 PM7/13/23
to agro...@gmail.com, Genomic SEM Users
Hi there,

Thank you for bringing this up, Trine. This is foreshadowing an issue I will be having since I plan on using my user model GWAS for PRS modeling. Following your suggestions Andrew, I tried unit loading identification with std.lv = FALSE and the model was unable to converge like you predicted. Out of curiosity, I also tried using std.lv = FALSE alone without ULI, and the resulting fit was great. In this case where genetic variance is not constrained to 1 and we do not use ULI, how would we interpret the N_hat?

Lastly, could we possibly use a rough sample size estimate in the case that we cannot find a better solution to back out N? For example, a sort of scaled effective sample size of the indicators that were used to compute the latent factor. 

Hasan



--
You received this message because you are subscribed to the Google Groups "Genomic SEM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genomic-sem-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/genomic-sem-users/4b0975aa-b97c-40c7-ba9d-723676e73468n%40googlegroups.com.

Hasan Alkhairo

unread,
Jul 13, 2023, 2:43:50 PM7/13/23
to agro...@gmail.com, Genomic SEM Users
Hi there, 

Sorry for back to back, I just want to point out that ULI with std.lv=FALSE still gave me a good fit. I accidentally did NA* instead of 1* to fix the loadings. I'll report back my results for N_hat. 

Trine Nielsen

unread,
Jul 14, 2023, 1:33:23 AM7/14/23
to Genomic SEM Users
Hi Hasan and Andrew. Great to hear from both of you!

I tried to run the model (without SNPs) with ULI and std.lv=F as you suggested, Andrew. And the model is estimated without an error. I will now try to run the model with SNPs and calculate the Neffs for the latent factors in this model. 
If this new model with ULI gives reasonable Neffs for the two latent factors, should I then use the GWAS sumstats for this new model with ULI in downstream analyzes or should I use the GWAS sumstats, I have from my original GWAS-by-subtraction?

Best wishes Trine

agro...@gmail.com

unread,
Jul 14, 2023, 10:35:09 AM7/14/23
to Genomic SEM Users
Hi all, 

Trine, a good check would be that your p-values are approximately the same across the two GWAS specifications. These two different forms of model identification will give you different estimates and standard errors due to the different scalings, but the ratio of estimate over standard error (and the corresponding p-values) should be the same. It's because of this different scaling of estimate and standard error that I would hope the expected sample size would still be more within reason (despite producing equivalent p-values) with this new specification you are trying. 

With respect to the point Hasan brought up about model fit, this is not a situation where model fit would be interpreted. The reason is that this is a fully saturated model, meaning you have 0 degrees of freedom and so the model fit is guaranteed to be perfect (and Genomic SEM should print something to the screen that model fit shouldn't be interpreted in the model without individual SNP effects). 

Best, 
  Andrew

Elliot Tucker-Drob

unread,
Jul 14, 2023, 11:30:54 AM7/14/23
to agro...@gmail.com, Genomic SEM Users
Dear Trine,

Please also be sure to use the equation for N_hat that we provide on the genomic sem wiki page that Andrew linked to. (Also, note that we refer to the predicted sample size based on the GWAS SEs as N_hat and not Effective N, which we reserve to refer to the expected sample size under an equally powered balanced case control design). I have not looked closely at the equation provided on Demange's github. It is possible that it is equivalent to the one that we provide, or may even be adapted for use with unit variance identified latent variables.

I should also say that we are relying on your evaluation that the computed N_hat is very low. What constitutes a plausible N_hat likely depends on the Ns of the contributing GWAS and the proportion of unexplained genetic variance in the downstream GWAS phenotype after controlling for the genetic component of the upstream GWAS phenotype. If the proportion is very low, we would likely expect the N_hat to be much smaller than the observed N.

It's hard to make a blanket recommendation for what to input into downstream analyses. That will likely depend on the software for the downstream analysis and what it uses. If the software only uses the Z and the N, then the betas input shouldn't matter at all as long as the beta to SE ratio is the same (which it should be), but the N input will matter. If the software uses betas that are expected to come from GWAS of standardized phenotypes, then you need to decide what scaling is most appropriate (e.g. unit loading vs. unit variance, and if using unit loading which GWAS phenotype to take on the metric of) and this decision is nontrivial. It may be that the N is used to compute the SE of the beta, so you need to choose the beta and N combination that would produce the correct Z statistic. Again, it's hard to provide blanket guidance, but you should be attentive to what the software is asking you to input and how it is using that information.

All the best,

Elliot



Kristian Ozol

unread,
Jul 18, 2023, 6:34:50 AM7/18/23
to Genomic SEM Users
Hello Andrew and Elliot,

Thank you for your thorough responses, I really appreciate all your help in this matter. I am currently on vacation but I will try out your suggestions when I get back.

Best wishes,
Trine
Reply all
Reply to author
Forward
0 new messages