Common Factor GWAS runtime

898 views
Skip to first unread message

dehug...@gmail.com

unread,
Jul 28, 2020, 9:59:28 AM7/28/20
to Genomic SEM Users
Hello. I am trying to run a common factor GWAS. I saw that the example on the github page took about 8 hours on an 8 core laptop. I have tried running it through several options, including my institution's computing cluster using 4 cores, however it took longer than 192 hours and was aborted due to exceeding the allotted time. I think there may be a bug, but I'm not sure how to figure that out. Can you assist please? I have pasted my code below. I successfully munged the data prior to the below code. The ldsc and sumstats function were successful. However, running commonfactorGWAS gets stuck (>192 hours).


traits.snp <- c('adhd.sumstats.gz','asd.sumstats.gz','mdd.sumstats.gz', 'ts.sumstats.gz')
sampleprev.snp <- c(.36,.40,.35,.34)
popprev.snp <- c(.05,.01,.15,.008)
trait.names.snp <- c('ADHD','ASD','MDD','TS')
ld <- "eur_w_ld_chr/"
wld <- "eur_w_ld_chr/"

ldsc.snpef <-  ldsc(traits.snp, sampleprev.snp, popprev.snp, ld, wld, trait.names.snp)      


files = c('PGC_Files/ADHD/pgc.adhd.reformat.txt', 'PGC_Files/ASD/pgc.asd.reformat.txt','PGC_Files/MDD/pgc.mdd.reformat.txt','PGC_Files/TS/pgc.ts.reformat.txt')
ref='PGC_Files/reference.1000G.maf.0.005.txt'
trait.names.p <- c('ADHD','ASD','MDD','TS')
se.logit=c(T,T,T,T)

p_sumstats <- sumstats(files=files, ref=ref, trait.names = trait.names.p, se.logit=se.logit)


pfactor <- commonfactorGWAS(covstruc = ldsc.snpef, SNPs = p_sumstats, parallel = TRUE)

Andrew Grotzinger

unread,
Jul 28, 2020, 1:34:20 PM7/28/20
to Genomic SEM Users
This is a great question that I can hopefully provide some clarity on and help troubleshoot. To start, run time is going to depend on three main factors:
1. The number of SNPs left after running the sumstats function. Here, I wonder if you are left with quite a few more variants relative to the example on the github. 

2. How fast the model without individual SNPs runs (and relatedly, how well it fits the data). I know for me the Tourette Syndrome summary statistics have generally caused problems with model convergence depending on the included traits. A good test here would be to run the commonfactor function using just the ldsc output and see what the data looks like. It might take only .5 seconds longer than the example on the github, but that scales up pretty quickly with respect to total run time when the model is being estimated across a million plus SNPs.

3. The computing power for whatever you are using to run your analyses. For most computing clusters, my sense is a standard number of cores for a single run is 12+ (on our computing cluster it's 24 cores per run). With that said, I would recommend using more than the 4 cores initially used if possible. 

To summarize, I would recommend taking a look at how well a common factor model fits the data, determine how many SNPs are being analyzed based on the number of rows in the sumstats output, and increase the number of cores if possible. If the end result is you have a model that fits the data well, but have a lot of SNPs or the model takes a little longer to estimate, I would additionally recommend splitting up the data across different jobs. I know others have written scripts that will automatically split the sumstats output into chunks of 20 or so and automatically send off 20 different jobs. As the model estimated for each SNP is entirely independent of one another, how the data is split is completely up to you. As a final troubleshooting option, you might try running a subset of 5,000 SNPs just to make sure it isn't a more general bug. Let me know if you have any other questions!

Best, 
  Andrew

Michel Nivard

unread,
Jul 28, 2020, 4:34:33 PM7/28/20
to Genomic SEM Users
Let me add some Q's that may help us focus in on the problem further.

can you share with us the runtime of the model without a SNP in it, and perhaps the runtime of the model with a selection of 100 SNPs?

And could you also share the model output for the 1 factor model without a SNP in it with us so we can hel you judge whether the model perhaps poorly fits that data to a point where it has trouble converging (i.e converges very slowly).

If you run a couple of hundred SNPs do you get any particular warnings passed on by lavaan??

Best,
Michel

dehug...@gmail.com

unread,
Jul 28, 2020, 4:57:11 PM7/28/20
to Genomic SEM Users
Great, thanks so much. It appears I was just a bit impatient and using too few cores!

dehug...@gmail.com

unread,
Jul 28, 2020, 5:06:22 PM7/28/20
to Genomic SEM Users
Didn't see this until after I posted the last, apologies. Runtime of the model without a SNP was a few seconds. The output is below. With 500 SNPs the runtime with 16 cores was about 15-30 secs, which was a surprising amount of variance I thought (but I could be wrong). I separated up the SNPs and am running them on separate processors totaling 64 cores, so hopefully it should take about a day. I'm curious to know your opinion of the fit of the model based on the below, and perhaps explain what made you arrive at your conclusion if you wouldn't mind? Just for the sake of learning.  Thanks, Dylan

$modelfit
      chisq df   p_chisq      AIC CFI       SRMR
df 1.197306  2 0.5495514 17.19731   1 0.01851209

$results
    lhs op  rhs Unstandardized_Estimate Unstandardized_SE Standardized_Est Standardized_SE      p_value
1    F1 =~ ADHD              0.33507432        0.03083809        0.7178579      0.06606705 1.681190e-27
2    F1 =~  ASD              0.16960056        0.01643711        0.5071835      0.04915450 5.833675e-25
3    F1 =~  MDD              0.22165838        0.02316858        0.7922188      0.08280573 1.098421e-21
4    F1 =~   TS              0.11431648        0.02527523        0.2483013      0.05489912 6.100786e-06
20 ADHD ~~ ADHD              0.10559914        0.02153708        0.4846800      0.09885113 9.432134e-07
21  ASD ~~  ASD              0.08305690        0.01037405        0.7427649      0.09277347 1.182870e-15
22  MDD ~~  MDD              0.02915247        0.01037271        0.3723894      0.13249945 4.946496e-03
23   TS ~~   TS              0.19889445        0.02438859        0.9383465      0.11506074 3.485276e-16

Michel Nivard

unread,
Jul 29, 2020, 10:51:28 AM7/29/20
to Genomic SEM Users
So model fits the data well (briefly: the chisq test, which tests the model against a freely estimated covariance matrix isnt significant, which means the model doest describe the covariance significantly worse, CFI is 1, CFI ranges form 0 to 1 and is a comparison against an "empty" model where the traits are uncorrelated, 1 means good ft compared to an empty model. then the SRMR quantifies the (weighted I believe) squared difference between the observed and expected covariance matrix, its lowish here, which is good. 

This all being said, the model fits the covariance matrix well, but the factor doest explain a lot of the heritable variation in for example TS ( .114^2 / (.114^2 + .199) = ~6%)

so if you added TS because you are interested in genetic effects on TS be forewarned this factor isn't really a reflection of TS (because TS doest correlate to highly with the other traits I think)

Your model is slower than usual I'd sya, but not entirely unheard off. To speed things up you can specify starting values near (but not at!) the final estimates and perhaps things go a bit quicker.

Best,
Michel

Andrew Grotzinger

unread,
Jul 29, 2020, 6:28:22 PM7/29/20
to Genomic SEM Users
I agree with Michel; I think the model fits relatively well, but seems to be running a little slow potentially because TS is sitting off by itself in terms of the factor loadings (and the factor should be interpreted in this light as Michel also notes). I think the common factor model on the github runs in about .3 seconds, so if your model runs for a couple seconds that will scale up run time pretty quickly. That said, sounds like you were able to split up the run and use a larger number of cores and hopefully that does the trick but if you are still having run time issues let us know!

dehug...@gmail.com

unread,
Jul 31, 2020, 1:35:31 PM7/31/20
to Genomic SEM Users
Thanks, both! Very informative. I also realized a lot of this is in the paper, which is very helpful. I had another question though: what does it mean if the chisq and AIC is 'NA'? I also receive the error "system is computationally singular", which I believe has something to do with a poor model(?), when I try to generate the gwas for the common factor. Are the two (chisq=NA and the error) related?

Thanks again,
Dylan

On Wednesday, July 29, 2020 at 10:51:28 AM UTC-4, Michel Nivard wrote:

Andrew Grotzinger

unread,
Aug 3, 2020, 9:04:49 AM8/3/20
to Genomic SEM Users
Hi Dylan,

I think those two issues are unrelated. For the chisq and AIC = NA that should also come with a message that those are printing as NA because you are estimating a fully saturated model (i.e., degrees of freedom = 0), as will be the case when you estimate a common factor with three indicators. Not something to be concerned about, you just can't model fit estimates in this case because the model is going to exactly fit the data. 

The system is computationally singular error in the context of the common factor gwas will sometimes print with a reciprocal condition number. Is that number included in the error message? 

-Andrew

pxwa...@gmail.com

unread,
Aug 3, 2020, 4:34:10 PM8/3/20
to Genomic SEM Users
Hi Michel and Andrew,

I tried the common factor model without SNP effects. It had problems with convergence (see below):
> commonfactor_DWLS <- commonfactor(covstruc = LDSCoutput, estimation="DWLS")
[1] "The common factor initially failed to converge. A lower bound of 0 on residual variances has been added to try and troubleshoot this."
[1] "The common factor model failed to converge on a solution. Please try specifying an alternative model using the usermodel function."
Error in lav_fit_measures(object = object, fit.measures = fit.measures, :
lavaan ERROR: fit measures not available if model did not converge


I also used the first 100 snps in the common factor gwas with a cluster of 38 cpus. And it had 38 warnings below. I think these warnings were related to the 38 cores requested and caused by the failure to converge as indicated above.
> SNPs <- commonfactor_sumstats[1:100, ]
> cores <- 38
> covstruc <- LDSCoutput
> snp100_commonfactor_gwas <- commonfactorGWAS(covstruc = covstruc, SNPs = SNPs, cores=cores)
[1] "Please note that an update was made to commonfactorGWAS on 11/21/19 so that it combines addSNPs and commonfactorGWAS."
This is lavaan 0.6-5
lavaan is BETA software! Please report any bugs.
elapsed
37.7
There were 38 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In for (i in 1:length(args)) { ... :
closing unused connection 40 (<-localhost:11471)
2: In for (i in 1:length(args)) { ... :
closing unused connection 39 (<-localhost:11471)
3: In for (i in 1:length(args)) { ... :
closing unused connection 38 (<-localhost:11471)
4: In for (i in 1:length(args)) { ... :
closing unused connection 37 (<-localhost:11471)
5: In for (i in 1:length(args)) { ... :
closing unused connection 36 (<-localhost:11471)
6: In for (i in 1:length(args)) { ... :
closing unused connection 35 (<-localhost:11471)
7: In for (i in 1:length(args)) { ... :
closing unused connection 34 (<-localhost:11471)
8: In for (i in 1:length(args)) { ... :
closing unused connection 33 (<-localhost:11471)
9: In for (i in 1:length(args)) { ... :
closing unused connection 32 (<-localhost:11471)
10: In for (i in 1:length(args)) { ... :
closing unused connection 31 (<-localhost:11471)
11: In for (i in 1:length(args)) { ... :
closing unused connection 30 (<-localhost:11471)
12: In for (i in 1:length(args)) { ... :
closing unused connection 29 (<-localhost:11471)
13: In for (i in 1:length(args)) { ... :
closing unused connection 28 (<-localhost:11471)
14: In for (i in 1:length(args)) { ... :
closing unused connection 27 (<-localhost:11471)
15: In for (i in 1:length(args)) { ... :
closing unused connection 26 (<-localhost:11471)
16: In for (i in 1:length(args)) { ... :
closing unused connection 25 (<-localhost:11471)
17: In for (i in 1:length(args)) { ... :
closing unused connection 24 (<-localhost:11471)
18: In for (i in 1:length(args)) { ... :
closing unused connection 23 (<-localhost:11471)
19: In for (i in 1:length(args)) { ... :
closing unused connection 22 (<-localhost:11471)
20: In for (i in 1:length(args)) { ... :
closing unused connection 21 (<-localhost:11471)
21: In for (i in 1:length(args)) { ... :
closing unused connection 20 (<-localhost:11471)
22: In for (i in 1:length(args)) { ... :
closing unused connection 19 (<-localhost:11471)
23: In for (i in 1:length(args)) { ... :
closing unused connection 18 (<-localhost:11471)
24: In for (i in 1:length(args)) { ... :
closing unused connection 17 (<-localhost:11471)
25: In for (i in 1:length(args)) { ... :
closing unused connection 16 (<-localhost:11471)
26: In for (i in 1:length(args)) { ... :
closing unused connection 15 (<-localhost:11471)
27: In for (i in 1:length(args)) { ... :
closing unused connection 14 (<-localhost:11471)
28: In for (i in 1:length(args)) { ... :
closing unused connection 13 (<-localhost:11471)
29: In for (i in 1:length(args)) { ... :
closing unused connection 12 (<-localhost:11471)
30: In for (i in 1:length(args)) { ... :
closing unused connection 11 (<-localhost:11471)
31: In for (i in 1:length(args)) { ... :
closing unused connection 10 (<-localhost:11471)
32: In for (i in 1:length(args)) { ... :
closing unused connection 9 (<-localhost:11471)
33: In for (i in 1:length(args)) { ... :
closing unused connection 8 (<-localhost:11471)
34: In for (i in 1:length(args)) { ... :
closing unused connection 7 (<-localhost:11471)
35: In for (i in 1:length(args)) { ... :
closing unused connection 6 (<-localhost:11471)
36: In for (i in 1:length(args)) { ... :
closing unused connection 5 (<-localhost:11471)
37: In for (i in 1:length(args)) { ... :
closing unused connection 4 (<-localhost:11471)
38: In for (i in 1:length(args)) { ... :
closing unused connection 3 (<-localhost:11471)

I'll try usermodel and post the results.

Many thanks,
patrick

Rajini Nagrani

unread,
Oct 26, 2022, 5:42:01 AM10/26/22
to Genomic SEM Users
Hello,

 if I may jump here as I have similar query. I am running commonfactorGWAS and calculated heritable variation (as shown above) for the component traits and one of the traits (fg_new) shows hv  of only 1%.  Please see attachment with heritable variation, genetic correlation and genetic covariances matrix

I have following questions:

1. What could be the disadvantages of going ahead with this model. 
2. Though fg_new doesn't correlate with other factors but it shows a moderate correlation with homa_IR (~0.32) ? Should this be a good enough reason to use this fg_new sumstats?
3. I am also using the factanal function to calculate 2 correlated factor model. May I know  what would be the formula then for heritable variation?

Sorry if these are too basic questions for the group. But I couldn't find an answer elsewhere?

Looking forward to your suggestions

Best regards,
Rajini




Factors explaining heritable variation.xlsx
Reply all
Reply to author
Forward
0 new messages