Re: [GenomicSEM/GenomicSEM] Question about implied sample size (Issue #100)

Elliot Tucker-Drob

unread,

Feb 7, 2025, 3:47:17 PMFeb 7

to GenomicSEM/GenomicSEM, Genomic SEM Users, GenomicSEM/GenomicSEM, Subscribed

Please see https://github.com/GenomicSEM/GenomicSEM/wiki/5.-Multivariate-GWAS

Calculating Sample Size for Factors Once the summary statistics for a common factor are produced, the user may wish to input those summary statistics into an outside program that requires a sample size (e.g., LDSC, LDpred). A description of one method for calculating effective sample size can be found in the online supplement of bioRxiv pre-print from Mallard et al. (2019): https://www.biorxiv.org/content/10.1101/603134v1.abstract. The equation described in the supplement can be written as below to produce the effective sample size for Factor 1. Note that the recommendation is also made to restrict the summary statistics to MAF limits of approximately 10% and 40% in order to produce more stable estimates.

##Calculate Implied Sample Size for Factor 1
#restrict to MAF of 40% and 10%
CorrelatedFactors1<-subset(CorrelatedFactors, CorrelatedFactors$MAF <= .4 & CorrelatedFactors$MAF >= .1)

N_hat_F1<-mean(1/((2*CorrelatedFactors1[[1]]$MAF*(1-CorrelatedFactors1[[1]]$MAF))*CorrelatedFactors1[[1]]$SE^2))

Note that when the phenotype is a latent factor, the choice of scaling the factor will have a nontrivial effect on the estimate of N_hat. Here we scale the latent genetic factors with unit loading identification, such that N_hat can be intuitively interpreted as the expected sample size for the factor scaled in the heritability units of the standardized reference phenotype (i.e. the phenotype with the unstandardized loading fixed to 1.0, in this case MDD). If we were to scale the latent genetic factors with unit variance identification, N_hat would be interpreted relative to a factor that is 100% heritable, and N_hat would be unintuitively very small (because, all else being equal, highly heritable phenotypes require smaller sample sizes to detect genetic associations).

On Fri, Feb 7, 2025 at 2:43 PM zhong156 <notifi...@github.com> wrote:

Thank you so much for developing this wonderful tool!
I ran multivariate gwas for four factors, and when I calculated the implied sample size for one of the resulting factor gwas using the provided formula, I got an extremely large value of 15 million. I was wondering if this means there is some inflation in the estimation and if the implied sample size calculation could be adjusted to provide more accurate results.
Thank you so much!

—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.

Elliot Tucker-Drob

unread,

Feb 7, 2025, 4:00:48 PMFeb 7

to GenomicSEM/GenomicSEM, Genomic SEM Users, GenomicSEM/GenomicSEM, Comment

It's not just a matter of using the formula. Please see the portion that I pasted about how the scaling of the factor affects interpretation of the N_hat estimate. Just as you can get a very small N_hat when you use unit variance identification, you can get a very large N_hat when you use unit loading identification relative to a reference indicator that has low h2 and/or has a very low loading on the factor.

On Fri, Feb 7, 2025 at 2:55 PM zhong156 <notifi...@github.com> wrote:

Thank you so much for your reply!
I did use the formula that you provided above to calculate the implied sample size. I checked the SE of my factor gwas result and the number was small with mean of about 0.0006. I was wondering if this was affecting the estimate to give me a very large implied sample size.

—
Reply to this email directly, view it on GitHub, or unsubscribe.

You are receiving this because you commented.

Elliot Tucker-Drob

unread,

Feb 7, 2025, 4:45:30 PMFeb 7

to GenomicSEM/GenomicSEM, Genomic SEM Users, GenomicSEM/GenomicSEM, Comment

you're asking me to make backwards inference about what model you fit and what your parameter estimates are. you should inspect your unconditional (no-snp) model to answer that question for yourself.

the n that you input to LDSC is highly consequential for h2 and cov_g estimates, but will not affect rg estimates. you should exercise care with respect to that decision.

On Fri, Feb 7, 2025 at 3:30 PM zhong156 <notifi...@github.com> wrote:

Thank you so much for your reply!

Does this mean in the multivariate gwas analysis, a trait with low h2 was used as unit loading identification for this factor so I get a very large sample size estimate? I was wondering if I could use this large sample size in LDSC estimate, or I should use a different sample size estimate method.
Thank you so much!