Binary vs continuous traits

581 views
Skip to first unread message

Tabea

unread,
Mar 11, 2021, 6:27:31 AM3/11/21
to Genomic SEM Users

Dear all,
I just wanted to quickly double check if the specification for the se.logit/linprob parameters I use when running sumstats() are correct given the mix of binary/continuous traits I am including.

So I have
- 2 continuous traits (“continous1”,”continous2”)
- 2 binary traits where the beta/SEs are reported on the logistic scale (“binary1”, “binary2”)
- 1 UK biobank binary trait that was analysed using the Hail software (“binaryHail”), with the proportion of cases being prop=0.5

In sumstats(), I included the following:
traits=c(“continous1”,”continous2”, “binary1”, “binary2”, “binaryHail”)
se.logit=c(F, F, T, T, F)
linprob=c(F, F, F, F, T)
prob=c(NA, NA, NA, NA, 0.5)

Does that sound about right ?

As usual, many thanks for your help and the great work you are doing

Best wishes,
Tabea

agro...@gmail.com

unread,
Mar 11, 2021, 9:26:09 AM3/11/21
to Genomic SEM Users

Hi Tabea, 

All of the arguments you specified are correct, but you also need to specify the OLS argument for your two continuous traits, so you would additionally write:
OLS=c(T,T,F,F,F)

I've recently tried to update the .log file that the sumstats argument creates to be more clear about how it is interpreting column names and performing conversions so that can be a helpful thing to check in general if you are uncertain about how you specified the parameters. Since I know sumstats is a particularly confusing function, I also added a flowchart to one of the general information pages on the Genomic SEM github in case that's also helpful (https://github.com/GenomicSEM/GenomicSEM/wiki/2.-Important-resources-and-key-information). Let me know if you have any other questions!

Best, 
  Andrew

Tabea

unread,
Mar 11, 2021, 10:21:25 AM3/11/21
to Genomic SEM Users
Hi Andrew,

Great, many thanks for double checking, I will definitely add the OLS argument. The flowchart is super useful too!

BW,
Tabea

Tabea Schoeler

unread,
Apr 16, 2021, 3:28:40 AM4/16/21
to Genomic SEM Users
Hi Andrew,

Just a quick follow up question - I noticed that some of my included summary statistics for continuous phenotypes are already standardized with respect to the variance of the phenotype. So I was wondering if I need to adjust the sumstats() arguments for those particular phenotypes, as they don't need any further tranformation before running the GWA?

Best wishes,
Tabea

agro...@gmail.com

unread,
Apr 16, 2021, 8:32:44 AM4/16/21
to Genomic SEM Users
Hi Tabea, 

Good question! You'll still want to run it through sumstats setting the OLS argument to TRUE. This just makes ensures everything is on the same scale even when a phenotype has been standardized, and since sumstats is additionally aligning to a single reference allele across traits it's still necessary to run this pre-processing step. 

Best, 
  Andrew

Tabea

unread,
Apr 16, 2021, 9:29:56 AM4/16/21
to Genomic SEM Users
Ok thanks, good to know, that's what I had done but I wasn't sure if prior standardization was causing some weird estimates I keep getting when running the common factor GWAS. For example, when obtaining LDCS heritability and intercept estimates of a common factor composed of highly correlated and powerful continuous UKBB traits (weight, height, hip & waist circumference), I get an effective sample size that is way to low (Neff= 73593 ) and the LDSC estimates are also off:
h2=1.03
intercept: 0.60
lamda GQ: 1.71

Any idea what might be causing this?

BW,
Tabea

agro...@gmail.com

unread,
Apr 16, 2021, 10:15:37 AM4/16/21
to Genomic SEM Users
That's informative that the intercept is super deflated. A couple follow-up questions to help figure this out: 
1. How were the continuous UKBB traits analyzed in the original GWAS (e.g., BOLT-LMM)? 
2. Are you estimating the GWAS using the commonfactorGWAS function or the userGWAS function? 
3. Are there any errors or warnings in the GWAS output? Sometimes you might get numbers like that if, for example, there are warnings about negative ov or lv variances printed by lavaan. 

Tabea

unread,
Apr 16, 2021, 10:42:14 AM4/16/21
to Genomic SEM Users
Sure!

Regarding 1), I used the Pan-UK biobank sumstats and only included inverse rank normalize transformed phenotypes, which were analyzed using SAIGE. (cf info here: https://pan.ukbb.broadinstitute.org/docs/technical-overview/index.html )

2) I use userGWAS() as below:

model <- 'F1 =~ NA*BMI +  weight + hip_circumference + waist_circumference
F1 ~~ 1*F1
BMI ~~ a*BMI
weight ~~ b*weight
hip_circumference ~~ c*hip_circumference
waist_circumference ~~ d*waist_circumference
a > .0001
b > .0001
c > .0001
d > .0001
F1 ~ SNP'

results = userGWAS(covstruc=LDSCoutput,
SNPs=SNPdata,
estimation = "DWLS",
model = model,
printwarn = TRUE,
toler = 1e-500,
parallel = FALSE,
SNPSE = 0.05,
sub=c("F1~SNP"),
GC="none")

3) There are no errors or warnings

Elliot Tucker-Drob

unread,
Apr 16, 2021, 11:06:42 AM4/16/21
to Tabea, Genomic SEM Users
Hi Tabea.

A few thoughts:
In particular note: "In order for ld-score regression to produce accurate results it is critical that the user both include only summary statistics that were calculated within a single ethnic group AND that these summary statistics are matched with LD scores for the same ethnicity. "

2. We recommend unit loading identification for factors-as-outcomes models (i.e. models with SNPs predicting factors). In our experience this tends to avoid issues surrounding convergence, local fit minima etc... Additionally, the effective N estimate is linked with the SNP heritability of the phenotype. The intuition behind this is rather straightforward: you need larger sample sizes to detect genetic effects on variables that have lower overall heritability. When you use unit variance identification, your SNP betas are essentially being scaled such that the heritability of the outcome is 100% (i.e. the betas are larger), so effective N is smaller (you don't need as large a sample to detect larger betas). However, we never encounter phenotypes with SNP h2 anwhere close to 1.0, so the effective N will not be on an intuitive scale. Rather, it would be better for your factor to take on the scale of the SNP heritability of one of its indicators, so that the effective N is more sensible and intuitive. You can achieve this with unit loading identification, keeping in mind that the effective N will be linked to the portion of SNP h2 in the reference indicator that you ahve chosen that is explained by the common factor.

Elliot 


--
You received this message because you are subscribed to the Google Groups "Genomic SEM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genomic-sem-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/genomic-sem-users/9f1eeb5c-ee6a-48b6-a17e-5067665d1716n%40googlegroups.com.

Tabea

unread,
Apr 26, 2021, 11:45:46 AM4/26/21
to Genomic SEM Users
Hi Elliot,

Many thanks for the detailed response, it's now much clearer to me why I got these low estimates for the effective N. Indeed, using unit loading identification instead of unit variance identification gave me an Neff of the common factor that is very close to the Ns of the indicators I am using.

There is still a problem with the intercept of the common factor unfortunately, which is I think is because I misspecify something when using the userGWAS() function. When I ran a GWA using commonfactorGWAS() (default parameters), I get an intercept estimate of the common factor that is close to 1, but with userGWAS() this drops to 0.6.

Is there anything I need to change in the code below in order to reproduce the results from commonfactorGWAS()? 

Best wishes,
Tabea

model <- 'F1 =~ 1*BMI +  weight + hip_circumference + waist_circumference
      BMI ~~ a*BMI
      weight ~~ b*weight
      hip_circumference ~~ c*hip_circumference
      waist_circumference ~~ d*waist_circumference
      a > .0001
      b > .0001
      c > .0001
      d > .0001
      BMI ~ 0*SNP
      weight ~ 0*SNP
      hip_circumference ~ 0*SNP
      waist_circumference ~ 0*SNP
      F1 ~ SNP’

results = userGWAS(covstruc=LDSCoutput, 
                   SNPs=SNPdata, 
                   estimation = "DWLS",
                   model = model, 
                   printwarn = TRUE, 
                   toler = FALSE, 
                   parallel = FALSE, 
                   SNPSE = 0.0005,  
                   sub=c("F1~SNP"),
                   GC="none”)

Tabea

unread,
Apr 27, 2021, 5:23:13 AM4/27/21
to Genomic SEM Users
One thing I forgot to mention is that I ran one GWA using the Neale UKBB sumstats and the other using the Pan UKBB sumstats (European ancestry only), where the intercept seems to be fine when using commonfactorGWAS() in the Neale data but somewhat off in the Pan data. I haven't used userGWAS() in the Neale data, so I will do that first to check if this resolves the issue I have with the intercept

Elliot Tucker-Drob

unread,
Apr 27, 2021, 9:49:38 AM4/27/21
to Tabea, Genomic SEM Users
If you get different results for common factorgwas vs. the same model specified as a usergwas, there is indeed likely to be a coding error. I see you add positivity constraints on the residuals. If those are the culprits (i.e. check by releasing them to see if that changes things), that may suggest that the constraints are inappropriate or the model itself may not be  appropriate for the data. I suggest first checking to make sure that the model produces sensible estimates in the no-snp version, and that there isn't a smooth warning indicating a major departure from a positive definite matrix.

I also note that Pan UKBB uses a mixed model, which can be an issue for interpretation of ldsc estimates, especially heritability and ldsc intercept. This paper suggests focusing on the attenuation ratio rather than the ldsc intercept.  
Using the Neal lab sumstats is sensible given this, but this won't be a reason for any discrepancy between commonfactogwas vs. usergwas.


Reply all
Reply to author
Forward
0 new messages