Advice on GWAS by subtraction

Sam Kleeman

unread,

Jan 19, 2021, 11:54:52 AM1/19/21

to Genomic SEM Users

Hi,

Many thanks for making this amazing package available. I am trying to use two measured traits (in UK Biobank data) to generate three latent traits, with factor relationships as in the image below.

I am struggling to get the model (initially without SNP) to converge with this approach. I have copied the code I am using below.

Would be really grateful for any advice on this!! Thanks in advance.

Kind regards,

Sam Kleeman

PhD Student, Cold Spring Harbor Laboratory, NY

Model

latenttrait1 =~ NA*measuredtrait1+ NA*measuredtrait2

latenttrait2 =~ NA*measuredtrait1

latenttrait3 =~ NA*measuredtrait2

latenttrait1~~1*latenttrait1

latenttrait2~~1*latenttrait2

latenttrait3~~1*latenttrait3

latenttrait1 ~~ 0*latenttrait2 + 0*latenttrait3

latenttrait2 ~~ 0*latenttrait3

measuredtrait1~~0*measuredtrait1

measuredtrait2~~0*measuredtrait2

measuredtrait1~~0*measuredtrait2

Model layout

Elliot Tucker-Drob

unread,

Jan 19, 2021, 12:09:00 PM1/19/21

to Sam Kleeman, Genomic SEM Users

Hi Sam,

A common factor isn't well identified with only two indicators. You'd need to impose a constraint (e.g. equal factor loadings). This and related issues are discussed nicely in Lohelin's Cholesky Note: https://doi.org/10.1007/BF02361160 . See, in particular, Fig 3 and associated discussion.

Additionally, you are attempting to decompose 2 SNP-trait effects into 3 SNP-trait effects, which you don't have the df for. For k GWAS traits, your model can only estimate k or fewer SNP effects per SNP, unless you are imposing some sort of additional constraints. So, if you have a SNP effect on the common factor, you can only estimate k-1 independent pathways from the SNPs to the individual indicators. See our discussion of the common and independent pathways model in the original Grotinger et al. paper in Nature Human Behaviour. Also see the de la Fuente et al. Nature Human Behaviour paper's supplement on ‘Interpreting the heterogeneity statistic.' Some of the original work on this sort of model was described a while ago but Muthen: https://doi.org/10.1007/BF02296397

All the best,

Elliot
--
Elliot M. Tucker-Drob, Ph.D.
Professor
Department of Psychology
Faculty Research Associate

Population Research Center
The University of Texas at Austin
108 E. Dean Keeton Stop A8000
Austin, TX 78712-0187
tucke...@utexas.edu
www.lifespanlab.com

--
You received this message because you are subscribed to the Google Groups "Genomic SEM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genomic-sem-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/genomic-sem-users/82f2c7f4-89a8-42ee-8a8b-817f551ee56dn%40googlegroups.com.

Sam Kleeman

unread,

Jan 20, 2021, 8:32:53 PM1/20/21

to Elliot Tucker-Drob, Genomic SEM Users

Dear Elliot,

Many thanks for your detailed response, that is very helpful. We have reviewed the work you suggested and changed our plans accordingly (to estimate only two factors from the available traits). However, I wanted to ask about expected runtimes: I have been using the script below which should fit the model to the first 1000 SNPs, this has been running on a 48 core machine for more than 8 hours so far today with no output apart from 'This is lavaan 0.6-7'. I have copied the script below, which is based on the R tutorial for GWAS by subtraction.

Kind regards,

Sam

load(file="Sumstats.RData")

load(file="LDSCoutput_cy_cr.RData")

LDSCoutput$I[1] = 1

LDSCoutput$I[4] = 1

sub = p_sumstats[1:1000,]

# model with SNP

model<-'C=~NA*TRAIT1 + start(0.2)*TRAIT1

NC=~NA*TRAIT1 +start(0.4)*TRAIT1 +start(0.4)*TRAIT2

C~SNP

NC~SNP

C~~1*C

NC~~1*NC

C~~ 0*NC

TRAIT1~~0*TRAIT1

TRAIT2~~0*TRAIT2

TRAIT1~~0*TRAI2

SNP~~SNP'

#Run the Genomic SEM GWAS

outputGWAS<-userGWAS(covstruc=LDSCoutput,SNPs=sub,estimation="DWLS",model=model, cores=48, parallel=TRUE, GC='none')

Michel Nivard

unread,

Jan 21, 2021, 2:30:21 AM1/21/21

to Genomic SEM Users

Can you try a run on a single core? this is unexpected behaviour, sure gemomicSEM can take a while to run (it fits a model per SNP after all) but more than a second per SNP (or may indicate issues. One issue is that parallel may create a copy of the entire workspace for each of 48cores, and if your workspace is alrge (i.e.c contains a lot of data) this would exceed RAM.

If t is the case that the GWAS runs fine on a single core you can split the data per chromosome, and stat 22 R processes next to each other (each with their own workspace containing only 1 chromosome of data), or run parallel with less cores, both options may be faster. If however the process still is slow on a single core come back tu us and we can take another look.

Best,

Michel

Sam Kleeman

unread,

Jan 21, 2021, 10:33:56 AM1/21/21

to Michel Nivard, Genomic SEM Users

Hi Michel,

Thanks for your email. Running with parallel=FALSE fixes the problem. Is there any reason why running with parallel=TRUE stops it from running?

Best wishes,

Sam

You received this message because you are subscribed to a topic in the Google Groups "Genomic SEM Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/genomic-sem-users/s2UDKbVzakg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to genomic-sem-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/genomic-sem-users/db75619f-63bf-41df-a3e8-5b6c9eedacb3n%40googlegroups.com.

Michel Nivard

unread,

Jan 21, 2021, 11:43:31 AM1/21/21

to Genomic SEM Users

Could be a few things what happens if you try like 2 cores? the issue could be RAM related so running 48 R instances, each with their own copy of the whole workspace may exceed the RAM you have available? I assume this is a linux system right?

Best,

Michel

Sam Kleeman

unread,

Jan 21, 2021, 11:56:08 AM1/21/21

to Michel Nivard, Genomic SEM Users

2 cores results in the same issue, we are running on Centos linux.

Sam

To view this discussion on the web visit https://groups.google.com/d/msgid/genomic-sem-users/b39d6781-3972-44f4-9991-52d07ffec06fn%40googlegroups.com.

Sam Kleeman

unread,

Jan 26, 2021, 10:53:04 PM1/26/21

to Michel Nivard, Genomic SEM Users, Meyer, Hannah

Hi Michel,

I was hoping to ask about the origin of the 'reference.1000G.maf.0.005.txt’ file provided with the package? Is this from all 1000G samples or just European? We are doing a multi-ancestry GWAS and I was hoping to use a reference file with the MAF from each population. I am using population specific UK biobank data (from https://pan.ukbb.broadinstitute.org) but I am finding odd behavior for rsIDs with more than one SNP - for example, rs74640353, when run through the sumstats function I get a strange output where this SNP gets repeated multiple times with the SE/beta number interleaved, example copied below. From some exemplar searches in the 'reference.1000G.maf.0.005.txt’ file, it looks like this type of SNP has been removed? These SNPs are not duplicated in the primary data, examples copied below. The output for this SNP should just be two rows, one for each allelic combination (C/A and C/T).

Really grateful for your help with this!

Best wishes,

Sam

[skleeman@bamdev2 ~]$ grep 89265101 /grid/wsbs/home_norepl/skleeman/reference.1000G.maf.0.005.txt
[skleeman@bamdev2 ~]$ 

[skleeman@bamdev2 EUR]$ grep 89265101 GWAS_summary.tsv

16:89265101	rs74640353	C	A	9.5953e-01	-7.2195e-01	6.9540e-02	9.8633e-01	431764	5.7000e-24

16:89265101	rs74640353	C	T	6.5306e-01	-9.9393e-02	2.8678e-02	1.0000e+00	431764	1.1000e-04

[skleeman@bamdev2 EUR]$ grep 89265101 /mnt/grid/janowitz/home/references/maf/panukb_snps0.005_EUR.tsv
rs74640353	16	89265101	4.0150e-02	C	A
rs74640353	16	89265101	3.4675e-01	C	T


SNPCHRBPMAFA1A2beta.CPse.CPbeta.EAse.EA
<chr><chr><int><dbl><chr><chr><dbl><dbl><dbl><dbl>
7276317rs7464035316892651010.04015CA-0.0553485920.005481726-0.0276634770.005481726
7276318rs7464035316892651010.04015CA-0.0553485920.005481726-0.0071781790.002261069
7276319rs7464035316892651010.04015CA-0.0087444680.002261069-0.0276634770.005481726
7276320rs7464035316892651010.04015CA-0.0087444680.002261069-0.0071781790.002261069
7276321rs7464035316892651010.34675CT-0.0553485920.005481726-0.0276634770.005481726
7276322rs7464035316892651010.34675CT-0.0553485920.005481726-0.0071781790.002261069
7276323rs7464035316892651010.34675CT-0.0087444680.002261069-0.0276634770.005481726
7276324rs7464035316892651010.34675CT-0.0087444680.002261069-0.0071781790.002261069

SNP	CHR	BP	MAF	A1	A2	beta.CP	se.CP	beta.EA	se.EA
7276317	rs74640353	16	89265101	0.04015	C	A	-0.055348592	0.005481726	-0.027663477	0.005481726
7276318	rs74640353	16	89265101	0.04015	C	A	-0.055348592	0.005481726	-0.007178179	0.002261069
7276319	rs74640353	16	89265101	0.04015	C	A	-0.008744468	0.002261069	-0.027663477	0.005481726
7276320	rs74640353	16	89265101	0.04015	C	A	-0.008744468	0.002261069	-0.007178179	0.002261069
7276321	rs74640353	16	89265101	0.34675	C	T	-0.055348592	0.005481726	-0.027663477	0.005481726
7276322	rs74640353	16	89265101	0.34675	C	T	-0.055348592	0.005481726	-0.007178179	0.002261069
7276323	rs74640353	16	89265101	0.34675	C	T	-0.008744468	0.002261069	-0.027663477	0.005481726
7276324	rs74640353	16	89265101	0.34675	C	T	-0.008744468	0.002261069	-0.007178179	0.002261069

On Jan 21, 2021, at 11:43, Michel Nivard <m.g.n...@gmail.com> wrote:

To view this discussion on the web visit https://groups.google.com/d/msgid/genomic-sem-users/b39d6781-3972-44f4-9991-52d07ffec06fn%40googlegroups.com.

Elliot Tucker-Drob

unread,

Jan 27, 2021, 9:29:51 AM1/27/21

to Sam Kleeman, Michel Nivard, Genomic SEM Users, Meyer, Hannah

Hi Sam,

You should not use Genomic SEM for a multi-ancestry analysis without substantial adaptation of the software and pipeline. This is something that we will be working on, but it will require a good deal of methodological innovation and won't be ready in the near term. This is because the LDSC portion of the S matrix that is modelled as part of userGWAS must take a single set of ancestry-specific LD scores. Versions of LDSC that take multiple ld reference files are still in development. You can of course run the entire Genomic SEM pipeline as-is with summary statistics that are all from a single ancestry group that is not European; all you would need to do is supply appropriate ld scores and reference sample MAFs for the ancestry group that you are analyzing. If you would like to analyze data from multiple ancestry groups, you should analyze each group one at a time (This is put nicely at the beginning of this wiki page: https://github.com/bulik/ldsc/wiki/What-Data-Are-Necessary-to-Estimate-Genetic-Correlation%3F). The reference files that are we currently link to are for European ancestry, and the Pan-UK resource that you link to provides reference files for other ancestry groups.

Elliot

--

Elliot M. Tucker-Drob, Ph.D.
Professor

Department of Psychology
Faculty Research Associate
Population Research Center
The University of Texas at Austin
108 E. Dean Keeton Stop A8000
Austin, TX 78712-0187
tucke...@utexas.edu
www.lifespanlab.com

To view this discussion on the web visit https://groups.google.com/d/msgid/genomic-sem-users/2DEE98C5-C308-430F-9C88-031DBB194D3B%40gmail.com.

Sam Kleeman

unread,

Jan 27, 2021, 9:34:16 AM1/27/21

to Elliot Tucker-Drob, Michel Nivard, Genomic SEM Users, Meyer, Hannah

Dear Elliot,

Sorry I did not mean to suggest we are including multiple ancestry groups in the same summary statistics. We have generated separate statistics for three ancestry groups (EUR, CSA, AFR) and we have separate LD statistics for each population derived from the Pan UK Biobank project, and these are used for the LD score regression. As a result, we would like to use separate allele frequency reference files for each population so that we include SNPs with MAF >0.005 in each ancestry group.

I have looked again at the reference file and there are no duplicate rsIDs present, is that because these have been removed as part of generating this file? I am experiencing very odd behavior with duplicate rsIDs using the sumstats function.

Kind regards,

Sam

Michel Nivard

unread,

Jan 27, 2021, 2:30:27 PM1/27/21

to Sam Kleeman, Elliot Tucker-Drob, Genomic SEM Users, Meyer, Hannah

Hi Sam,

I recommend you make a specific ref file for this analysis, no need to restrict to 1000GSNPs as is the case when you use public sumstats that are poorly documented or don’t have maf info. Here you have that info so go ahead and make your own ref file if you can.

Re the issue with multiple SNPs on one position, you may have caught a bug, I’ll look into it ASAP.

Thanks for reporting!

Michel

Op wo 27 jan. 2021 om 15:34 schreef Sam Kleeman <samkl...@gmail.com>

Sam Kleeman

unread,

Jan 27, 2021, 2:42:42 PM1/27/21

to Michel Nivard, Elliot Tucker-Drob, Genomic SEM Users, Meyer, Hannah

Dear Michel,

Really grateful for your help with this.

I suspect the problem may relate to the following line in sumstats.R, I wonder if the merge should be by=c(“SNP”, “A1”,”A2”)?

data.frame.out <- suppressWarnings(inner_join(data.frame.out,output,by="SNP",all.x=F,all.y=F))

Kind regards,

Sam

Michel Nivard

unread,

Jan 27, 2021, 3:06:21 PM1/27/21

to Genomic SEM Users

Ok, thanks for the possible fix, I quickly discussed this, and we see some issues, for one its hard to determine what the variance of a multi-alleic variant is. This in turn will cause issues down stream when defining the variance of a SNP in the covariance matrix.

Will require more thought on our end, perhaps we'll have to formally split the variant or skip them entirely.

Anyway, let me know if you have any further input/suggestions on this issue for us.

SNP	CHR	BP	MAF	A1	A2	beta.CP	se.CP	beta.EA	se.EA
	<chr>	<chr>	<int>	<dbl>	<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>
7276317	rs74640353	16	89265101	0.04015	C	A	-0.055348592	0.005481726	-0.027663477	0.005481726
7276318	rs74640353	16	89265101	0.04015	C	A	-0.055348592	0.005481726	-0.007178179	0.002261069
7276319	rs74640353	16	89265101	0.04015	C	A	-0.008744468	0.002261069	-0.027663477	0.005481726
7276320	rs74640353	16	89265101	0.04015	C	A	-0.008744468	0.002261069	-0.007178179	0.002261069
7276321	rs74640353	16	89265101	0.34675	C	T	-0.055348592	0.005481726	-0.027663477	0.005481726
7276322	rs74640353	16	89265101	0.34675	C	T	-0.055348592	0.005481726	-0.007178179	0.002261069
7276323	rs74640353	16	89265101	0.34675	C	T	-0.008744468	0.002261069	-0.027663477	0.005481726
7276324	rs74640353	16	89265101	0.34675	C	T	-0.008744468	0.002261069	-0.007178179	0.002261069