Advice on GWAS by subtraction

419 views
Skip to first unread message

Sam Kleeman

unread,
Jan 19, 2021, 11:54:52 AM1/19/21
to Genomic SEM Users
Hi,

Many thanks for making this amazing package available. I am trying to use two measured traits (in UK Biobank data) to generate three latent traits, with factor relationships as in the image below.

I am struggling to get the model (initially without SNP) to converge with this approach. I have copied the code I am using below.

Would be really grateful for any advice on this!! Thanks in advance.

Kind regards,

Sam Kleeman
PhD Student, Cold Spring Harbor Laboratory, NY

Model
latenttrait1 =~ NA*measuredtrait1+ NA*measuredtrait2
latenttrait2 =~ NA*measuredtrait1
latenttrait3 =~ NA*measuredtrait2

latenttrait1~~1*latenttrait1
latenttrait2~~1*latenttrait2
latenttrait3~~1*latenttrait3


latenttrait1 ~~ 0*latenttrait2 + 0*latenttrait3
latenttrait2 ~~ 0*latenttrait3

measuredtrait1~~0*measuredtrait1
measuredtrait2~~0*measuredtrait2
measuredtrait1~~0*measuredtrait2

Model layout

traits.png

Elliot Tucker-Drob

unread,
Jan 19, 2021, 12:09:00 PM1/19/21
to Sam Kleeman, Genomic SEM Users
Hi Sam,

A common factor isn't well identified with only two indicators. You'd need to impose a constraint (e.g. equal factor loadings). This and related issues are discussed nicely in Lohelin's Cholesky Note: https://doi.org/10.1007/BF02361160 . See, in particular, Fig 3 and associated discussion.

Additionally, you are attempting to decompose 2 SNP-trait effects into 3 SNP-trait effects, which you don't have the df for. For k GWAS traits, your model can only estimate k or fewer SNP effects per SNP, unless you are imposing some sort of additional constraints. So, if you have a SNP effect on the common factor, you can only estimate k-1 independent pathways from the SNPs to the individual indicators. See our discussion of the common and independent pathways model in the original Grotinger et al. paper in Nature Human Behaviour. Also see the de la Fuente et al. Nature Human Behaviour paper's supplement on ‘Interpreting the heterogeneity statistic.' Some of the original work on this sort of model was described a while ago but Muthen: https://doi.org/10.1007/BF02296397

All the best,

Elliot
--
Elliot M. Tucker-Drob, Ph.D.
Professor
Department of Psychology
Faculty Research Associate
Population Research Center
The University of Texas at Austin
108 E. Dean Keeton Stop A8000
Austin, TX 78712-0187
tucke...@utexas.edu
www.lifespanlab.com


--
You received this message because you are subscribed to the Google Groups "Genomic SEM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genomic-sem-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/genomic-sem-users/82f2c7f4-89a8-42ee-8a8b-817f551ee56dn%40googlegroups.com.

Sam Kleeman

unread,
Jan 20, 2021, 8:32:53 PM1/20/21
to Elliot Tucker-Drob, Genomic SEM Users
Dear Elliot,

Many thanks for your detailed response, that is very helpful. We have reviewed the work you suggested and changed our plans accordingly (to estimate only two factors from the available traits). However, I wanted to ask about expected runtimes: I have been using the script below which should fit the model to the first 1000 SNPs, this has been running on a 48 core machine for more than 8 hours so far today with no output apart from 'This is lavaan 0.6-7'. I have copied the script below, which is based on the R tutorial for GWAS by subtraction.

Kind regards,

Sam

load(file="Sumstats.RData")
load(file="LDSCoutput_cy_cr.RData")
LDSCoutput$I[1] = 1
LDSCoutput$I[4] = 1

sub = p_sumstats[1:1000,]

# model with SNP

model<-'C=~NA*TRAIT1 + start(0.2)*TRAIT1
        NC=~NA*TRAIT1 +start(0.4)*TRAIT1 +start(0.4)*TRAIT2

        C~SNP
        NC~SNP

        C~~1*C
        NC~~1*NC
        
        C~~ 0*NC

        TRAIT1~~0*TRAIT1
        TRAIT2~~0*TRAIT2
        TRAIT1~~0*TRAI2
        SNP~~SNP'


#Run the Genomic SEM GWAS
outputGWAS<-userGWAS(covstruc=LDSCoutput,SNPs=sub,estimation="DWLS",model=model, cores=48, parallel=TRUE, GC='none')


Michel Nivard

unread,
Jan 21, 2021, 2:30:21 AM1/21/21
to Genomic SEM Users

Can you try a run on a single core? this is unexpected behaviour, sure gemomicSEM can take a while to run (it fits a model per SNP after all) but more than a second per SNP (or  may indicate issues. One issue is that parallel may create a copy of the entire workspace for each of 48cores, and if your workspace is alrge (i.e.c contains a lot of data) this would exceed RAM.

If t is the case that the GWAS runs fine on a single core you can split the data per chromosome, and stat 22 R processes next to each other (each with their own workspace containing only 1 chromosome of data), or run parallel with less cores, both options may be faster. If however the process still is slow on a single core come back tu us and we can take another look.

Best,
Michel

Sam Kleeman

unread,
Jan 21, 2021, 10:33:56 AM1/21/21
to Michel Nivard, Genomic SEM Users
Hi Michel,

Thanks for your email. Running with parallel=FALSE fixes the problem. Is there any reason why running with parallel=TRUE stops it from running?

Best wishes,

Sam

You received this message because you are subscribed to a topic in the Google Groups "Genomic SEM Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/genomic-sem-users/s2UDKbVzakg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to genomic-sem-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/genomic-sem-users/db75619f-63bf-41df-a3e8-5b6c9eedacb3n%40googlegroups.com.

Michel Nivard

unread,
Jan 21, 2021, 11:43:31 AM1/21/21
to Genomic SEM Users
Could be a few things what happens if you try like 2 cores? the issue could be RAM related so running 48 R instances, each with their own copy of the whole workspace may exceed the RAM you have available? I assume this is a linux system right?

Best,
Michel

Sam Kleeman

unread,
Jan 21, 2021, 11:56:08 AM1/21/21
to Michel Nivard, Genomic SEM Users
2 cores results in the same issue, we are running on Centos linux.

Sam

Sam Kleeman

unread,
Jan 26, 2021, 10:53:04 PM1/26/21
to Michel Nivard, Genomic SEM Users, Meyer, Hannah
Hi Michel,

I was hoping to ask about the origin of the 'reference.1000G.maf.0.005.txt’ file provided with the package? Is this from all 1000G samples or just European? We are doing a multi-ancestry GWAS and I was hoping to use a reference file with the MAF from each population. I am using population specific UK biobank data (from https://pan.ukbb.broadinstitute.org) but I am finding odd behavior for rsIDs with more than one SNP - for example, rs74640353, when run through the sumstats function I get a strange output where this SNP gets repeated multiple times with the SE/beta number interleaved, example copied below. From some exemplar searches in the 'reference.1000G.maf.0.005.txt’ file, it looks like this type of SNP has been removed? These SNPs are not duplicated in the primary data, examples copied below. The output for this SNP should just be two rows, one for each allelic combination (C/A and C/T).

Really grateful for your help with this!

Best wishes,

Sam

[skleeman@bamdev2 ~]$ grep 89265101 /grid/wsbs/home_norepl/skleeman/reference.1000G.maf.0.005.txt
[skleeman@bamdev2 ~]$ 

[skleeman@bamdev2 EUR]$ grep 89265101 GWAS_summary.tsv
16:89265101 rs74640353 C A 9.5953e-01 -7.2195e-01 6.9540e-02 9.8633e-01 431764 5.7000e-24
16:89265101 rs74640353 C T 6.5306e-01 -9.9393e-02 2.8678e-02 1.0000e+00 431764 1.1000e-04

[skleeman@bamdev2 EUR]$ grep 89265101 /mnt/grid/janowitz/home/references/maf/panukb_snps0.005_EUR.tsv
rs74640353 16 89265101 4.0150e-02 C A
rs74640353 16 89265101 3.4675e-01 C T


SNPCHRBPMAFA1A2beta.CPse.CPbeta.EAse.EA
<chr><chr><int><dbl><chr><chr><dbl><dbl><dbl><dbl>
7276317rs7464035316892651010.04015CA-0.0553485920.005481726-0.0276634770.005481726
7276318rs7464035316892651010.04015CA-0.0553485920.005481726-0.0071781790.002261069
7276319rs7464035316892651010.04015CA-0.0087444680.002261069-0.0276634770.005481726
7276320rs7464035316892651010.04015CA-0.0087444680.002261069-0.0071781790.002261069
7276321rs7464035316892651010.34675CT-0.0553485920.005481726-0.0276634770.005481726
7276322rs7464035316892651010.34675CT-0.0553485920.005481726-0.0071781790.002261069
7276323rs7464035316892651010.34675CT-0.0087444680.002261069-0.0276634770.005481726
7276324rs7464035316892651010.34675CT-0.0087444680.002261069-0.0071781790.002261069


On Jan 21, 2021, at 11:43, Michel Nivard <m.g.n...@gmail.com> wrote:

Elliot Tucker-Drob

unread,
Jan 27, 2021, 9:29:51 AM1/27/21
to Sam Kleeman, Michel Nivard, Genomic SEM Users, Meyer, Hannah
Hi Sam,
You should not use Genomic SEM for a multi-ancestry analysis without substantial adaptation of the software and pipeline. This is something that we will be working on, but it will require a good deal of methodological innovation and won't be ready in the near term. This is because the LDSC portion of the S matrix that is modelled as part of userGWAS must take a single set of ancestry-specific LD scores. Versions of LDSC that take multiple ld reference files are still in development. You can of course run the entire Genomic SEM pipeline as-is with summary statistics that are all from a single ancestry group that is not European; all you would need to do is supply appropriate ld scores and reference sample MAFs for the ancestry group that you are analyzing. If you would like to analyze data from multiple ancestry groups, you should analyze each group one at a time (This is put nicely at the beginning of this wiki page: https://github.com/bulik/ldsc/wiki/What-Data-Are-Necessary-to-Estimate-Genetic-Correlation%3F). The reference files that are we currently link to are for European ancestry, and the Pan-UK resource that you link to provides reference files for other ancestry groups.
Elliot
--
Elliot M. Tucker-Drob, Ph.D.
Professor
Department of Psychology
Faculty Research Associate
Population Research Center
The University of Texas at Austin
108 E. Dean Keeton Stop A8000
Austin, TX 78712-0187
tucke...@utexas.edu
www.lifespanlab.com

Sam Kleeman

unread,
Jan 27, 2021, 9:34:16 AM1/27/21
to Elliot Tucker-Drob, Michel Nivard, Genomic SEM Users, Meyer, Hannah
Dear Elliot,

Sorry I did not mean to suggest we are including multiple ancestry groups in the same summary statistics. We have generated separate statistics for three ancestry groups (EUR, CSA, AFR) and we have separate LD statistics for each population derived from the Pan UK Biobank project, and these are used for the LD score regression. As a result, we would like to use separate allele frequency reference files for each population so that we include SNPs with MAF >0.005 in each ancestry group.

I have looked again at the reference file and there are no duplicate rsIDs present, is that because these have been removed as part of generating this file? I am experiencing very odd behavior with duplicate rsIDs using the sumstats function.

Kind regards,

Sam

Michel Nivard

unread,
Jan 27, 2021, 2:30:27 PM1/27/21
to Sam Kleeman, Elliot Tucker-Drob, Genomic SEM Users, Meyer, Hannah
Hi Sam,

I recommend you make a specific ref file for this analysis, no need to restrict to 1000GSNPs as is the case when you use public sumstats that are poorly documented or don’t have maf info. Here you have that info so go ahead and make your own ref file if you can.

Re the issue with multiple SNPs on one position, you may have caught a bug, I’ll look into it ASAP.

Thanks for reporting!

Michel

Op wo 27 jan. 2021 om 15:34 schreef Sam Kleeman <samkl...@gmail.com>

Sam Kleeman

unread,
Jan 27, 2021, 2:42:42 PM1/27/21
to Michel Nivard, Elliot Tucker-Drob, Genomic SEM Users, Meyer, Hannah
Dear Michel,

Really grateful for your help with this.

I suspect the problem may relate to the following line in sumstats.R, I wonder if the merge should be by=c(“SNP”, “A1”,”A2”)?

data.frame.out <- suppressWarnings(inner_join(data.frame.out,output,by="SNP",all.x=F,all.y=F))

Kind regards,

Sam

Michel Nivard

unread,
Jan 27, 2021, 3:06:21 PM1/27/21
to Genomic SEM Users
Ok, thanks for the possible fix, I quickly discussed this, and  we see some issues, for one its hard to determine what the variance of a multi-alleic variant is. This in turn will cause issues down stream when defining the variance of a SNP in the covariance matrix.

Will require more thought on our end, perhaps we'll have to formally split the variant or skip them entirely.

Anyway, let me know if you have any further input/suggestions on this issue for us.

Best,
Michel
Reply all
Reply to author
Forward
0 new messages