SNP and LD reference

Junjie Lu

unread,

Oct 14, 2020, 3:27:09 PM10/14/20

to plink2-users

Dear Chris:

I am asking for a friend who currently cannot access google groups. Any information would be appreciated. Thanks in advance.

I found that plink simply ignore those SNPs not in the LD reference bfile, even if those

SNPs are significant in the GWAS file.

So, for a GWAS with 10 SNPs whose P<5E-08, if none of these 10 SNPs exist in the LD

reference bfile, then plink --bfile my-ref --clump my-GWAS --clump-p1 5e-8 does not

output anything. I feel this is not the ideal behavior. These 10 SNPs are significant and

should be output by --clump, no matter they exist in the LD reference file or not.

It would be very nice to have the --clump option in plink2. Right now, it is only available

in plink1.9 Also, I found the nice option of “--update-chr”, is only available in plink1.9,

but not in plink2. This option is very useful when I need to strip or add the “chr” prefix

for “CHROM” in VCF file.

Thank you & best regards,

Christopher Chang

unread,

Oct 14, 2020, 3:51:23 PM10/14/20

to plink2-users

1. This is essentially a feature, not a bug. The point of --clump is to clump related-due-to-LD association results. Using a separate LD panel, instead of the dataset the association analysis was performed on, is actively counterproductive.

2. Use --output-chr to control whether the "chr" prefix is present in output files.

Junjie Lu

unread,

Oct 15, 2020, 8:00:51 PM10/15/20

to plink2-users

Dear Chris:

Just to relay a follow up question. Thank you for your quick response.

Thank you very much!

These days, GWAS summary statistics file is very easy to get, but the raw genetic "dataset the association analysis was performed on" is not easy to get.

For example, we could download UK Biobank based GWAS for public website, but not the UKB genetic data.

Also, UKB has ~500,000 samples and ~9,600,000 SNPs, I guess it is not suitble to be used as a LD ref file.

So, I still use Hapmap3 as LD reference panel. Many software including LDSC and GCTA also use Hapmap data as reference.

So, it would be really ideal for PLINK to keep significant GWAS SNPs in the output even when they are not in the LD reference file.

And it would be really good to implement --clump in plink2, so that we could use plink2 --pfile --clump.

BTW, I am going to run plink2 --glm on UKB imputed chrX data. Do I need to add any extra option for running chrX data, such as "--chr X" and "no-x-sex"?

Or, would you recommend me to run --glm on males and females separately and then meta-analyze these two?

Best regards,

Christopher Chang

unread,

Oct 16, 2020, 11:52:52 AM10/16/20

to plink2-users

1. It's easy enough to e.g. write a short Python script to postprocess --clump results in the manner described here.

2. Please read the --glm documentation, there's a section on chrX.

Junjie Lu

unread,

Oct 16, 2020, 11:13:28 PM10/16/20

to plink2-users

Dear Chris:

Thank you very much!

I read the --glm documentation on chrX (https://www.cog-genomics.org/plink/2.0/assoc#glm).

It says that “First, sex is normally included as an additional covariate. If you don't want this, add

the 'no-x-sex' modifier”. I usually create separate phenotype files for males and females and

do inverse normal transformation and run GWAS. In this case, there is only 1 sex in my

phenotype file. PLINK --glm still runs successfully without the “no-x-sex” modifier. I guess the

result will be the same if I used the “no-x-sex” modifier here. Nevertheless, it seems that I

should always use “no-x-sex” modifier after I created separate phenotype files for males and

females separately, correct? For ChrX analysis, I should always include the “--xchr-model 2”

option, correct?

I also read the documentation right below the chrX section. It says that “this is a change from

PLINK 1.x; the old --all-pheno flag is now effectively always on. If you have multiple quantitative

phenotypes with either no missing values, or missing values for the same samples, analyze them

all in a single --glm run!”. If I have a phenotype file with 10 traits, and each trait has some

missing data (NA), should I simply use “--pheno MY-PHENO.txt” (without specifying “--pheno-

name”) to invoke the default --all-pheno behavior? Or should I run each phenotype file

separately since each of my 10 traits have some missing data and the missing is on different

samples, therefore, this does not meet your criterion of “if you have multiple quantitative

phenotypes with either no missing values, or missing values for the same samples”?