haplotype association, again

195 views
Skip to first unread message

Stuart Willis

unread,
Jul 26, 2019, 2:26:06 PM7/26/19
to plink2-users
What is the latest on how to test haplotype (or phased multi-locus=multi-allelic) association?

I have inferred haplotypes across some markers of interest (in high LD) with ShapeIt2 and wish to test the association of these with my quantitative phenotype rather than the individual SNPs.

What is the best way to do this? Can plink2 handle a vcf if I recode the SNPs as a single locus with multiple alleles?

I know plink1.9 used to be able to handle haplotype association, though it was abandoned because it required inferring the haplotypes too, which was error prone (or so is my understanding).

What's the best solution here? Thanks a lot!

Stuart

Christopher Chang

unread,
Jul 26, 2019, 6:19:28 PM7/26/19
to plink2-users
Single locus with multiple alleles works for now, yes, though plink2 is currently limited to 254 ALT alleles per variant.

Stuart Willis

unread,
Jul 27, 2019, 3:19:35 PM7/27/19
to plink2-users
Great! Is that VCF coding with all alleles on one line or one line per alternate allele? I tried the former and PLINK2 reported a "GT" error, though VCFtools read the file just fine.

Is there a way to force the haps/sample format to be read as a single locus to test against multiple alleles?

Thanks!

Christopher Chang

unread,
Jul 27, 2019, 3:22:20 PM7/27/19
to plink2-users
The ALT alleles must all be on the same line. Can you post a test vcf that triggers the error you mention?

Stuart Willis

unread,
Jul 29, 2019, 1:11:55 PM7/29/19
to plink2-users
Hi Christopher,

So there's currently no way to force the haps/sample file to be interpreted this way?

I figured out that GT error was being caused by my alleles still being numbered 1 thru N instead of 0 thru N-1 when I created the VCF in R. It also took some experimenting but I got the pheno and covar data to read in when I omitted FID and included a header starting with #IID.

However, once I got the files in and ran --linear, the results are all NA for both each ALT alleleXphenotype as well as the covariates. Should this suggest my data being read incorrectly? Here's my command line:

plink2 --vcf allBonn2018_pheno.phased.vcf --out allBonn2018_pheno --pheno allBonn2018_pheno.pheno --covar allBonn2018_pheno.cv --linear

Also, what's the best way to read in sex information with VCF as the source file? I coded it as 1,-1 as a co-variate.

Thanks!

Stuart
allBonn2018_pheno.pheno
allBonn2018_pheno.cv
allBonn2018_pheno.phased.vcf

Christopher Chang

unread,
Jul 29, 2019, 8:25:10 PM7/29/19
to plink2-users
1. plink2 doesn't have built-in "variant join" (combining multiple biallelic variants in e.g. haps/sample into a single multiallelic variant when position/REF are identical) yet; you need to use e.g. bcftools to do this for now.
2. --linear is failing for two reasons:
  a. Some of your ALT alleles (e.g. #26) don't appear in your dataset at all, after filtering out samples with missing phenotype/covariate values.
  b. There are also three linear dependencies that must be removed to get a regression result: ALT #25 and ALT #30 always appear together, the sum of the ALT #11 and #31 columns is equal to the ALT #10 column, and the sum of the ALT #27 and #32 columns is equal to the ALT #21 column.  This sort of thing is inevitable when you include a bunch of haplotypes that only appear once.
3. --update-sex, with male=1 and female=2 encoding, is plink's standard way of filling in sex information.

Stuart Willis

unread,
Jul 30, 2019, 10:22:27 AM7/30/19
to plink2-users
Thanks for your response. Indeed this set was pared down from a slightly larger genotype file, so that should be easy enough to filter.

Regarding your second point, your advice then is to filter low frequency haplotypes before testing?

Christopher Chang

unread,
Jul 30, 2019, 12:10:40 PM7/30/19
to plink2-users
Yes; at minimum, I'd replace singleton haplotypes with missing genotype values before performing the regression.  (This doesn't fix everything--in your example VCF, ALT #10 and #11 would collide after this filtering step, so you'd need to tweak their representation.  I'm planning to add an optional "error" column to --glm's output which would tell you that ALT #10 is a linear combination of the other predictors in this case.)

Stuart Willis

unread,
Jul 30, 2019, 1:42:57 PM7/30/19
to plink2-users
I'm guessing I could either use a MAF high enough to get rid of 10 and/or 11, or just test them iteratively with one or the other coded as missing. I take it PLINK2 will handle vcf genotypes with half-missing heterozygotes? Thanks!

Christopher Chang

unread,
Jul 30, 2019, 1:50:36 PM7/30/19
to plink2-users
If you're including half-missing genotypes in your VCF, you'll need to use --vcf-half-call to control how plink2 interprets them.  (plink2 does not support half-missing heterozygotes; I'd just replace them with fully missing genotypes.)

You have enough 10/11 genotypes that I'd try not to use a MAF filter here.  (I'd want to replace them with 10/10 for haplotype testing purposes, though I realize it'll take some work to generalize this approach across an entire genome.)
Reply all
Reply to author
Forward
0 new messages