Input filtering based on imputation score (R2)

4,394 views
Skip to first unread message

O Kam

unread,
Jan 25, 2018, 11:08:47 AM1/25/18
to plink2-users
Hi!

I was wondering, whether Plink 2 is capable of filtering SNPs from vcf-files based on imputation score (R2)? Technically documentation says in the Input filtering-section that for example --exclude-if-info [key] [operator] [value] should do this . However at the same time  documentation of the vcf.-file input says that most of the unnecessary information (like quality scores) is discarded, though it doesn´t explicitely say that info-field where R2 is stored, cannot be used. So the question is - can the  --exclude-if-info flag (or some other flag) be used to filter out the variants based on R2?

best regards, 

- O

Christopher Chang

unread,
Jan 25, 2018, 1:37:34 PM1/25/18
to plink2-users
Plink 2.0 keeps everything in the INFO column (as long as you use --pfile/--make-pgen instead of --bfile/--make-bed), so you could use --exclude-if-info for this purpose.  (There is also --mach-r2-filter, which computes MaCH's flavor of imputation-r2 from scratch and filters on that.)

The main thing Plink 2.0 still doesn't keep is extra FORMAT fields (read depth, genotype quality, etc.) where every sample x variant combination has its own value.

O Kam

unread,
Jan 25, 2018, 5:22:54 PM1/25/18
to plink2-users
Ok, so technically, /plink2 --vcf [vcffilename] dosage=DS --exclude-if-info r2=<3 --score [scorefilename] should work? Or how one could refer to R2 from info field in --exclude-if-info flag?

- O

Christopher Chang

unread,
Jan 25, 2018, 5:28:49 PM1/25/18
to plink2-users
A few notes:
* This is case-sensitive.  If the key is 'R2' in the INFO column, you need to capitalize the R in your comparison expression.
* You need to put double-quotes around the comparison expression, since '<' is a special character in bash.
* The standard way of typing less-than-or-equal-to is '<=', and plink2 does not understand '=<' for now.  (I may change this if '=<' is more common than I was previously aware of, though.)

O Kam

unread,
Jan 25, 2018, 6:01:47 PM1/25/18
to plink2-users
Thank you! This was really informative! so... 
- this should work (assuming capitalized R):  /plink2 --vcf [vcffilename] dosage=DS --exclude-if-info "R2<=3" --score [scorefilename] 
- and fields from INFO column can be addressed by their names 

Christopher Chang

unread,
Jan 25, 2018, 6:36:55 PM1/25/18
to plink2-users
Yes, that looks correct.

(Incidentally, if you're feeding the same VCF to plink2 multiple times, you may want to use --keep-autoconv on the first run, and then use --pfile to load the auto-converted data in later runs.  This way, you don't have to repeat the relatively slow VCF-parsing step.)

alexi...@gmail.com

unread,
Feb 5, 2018, 7:54:07 PM2/5/18
to plink2-users
Hi Chirs,

But --make-pgen convert the dosage vcf file to the binary files with "best guess" genotypes, right? What if I want to keep the dosage info so that I can use dosage for regression?

Regards,
Alexis

在 2018年1月26日星期五 UTC+11上午10:36:55,Christopher Chang写道:

Christopher Chang

unread,
Feb 5, 2018, 8:01:39 PM2/5/18
to plink2-users
.pgen files can contain "best guess" genotypes and dosages *simultaneously*.  When they do, the dosages will be used during regression.

alexi...@gmail.com

unread,
Feb 5, 2018, 8:41:23 PM2/5/18
to plink2-users
Many thanks for the prompt reply! So this means that after imputation with Minimac3, I don't need to use dosageconvertor to convert vcf to plink.dosage files, and no need to use "--import-dosage" to convert the dosage files to pfiles? Just use "--vcf [vcffilename] dosage=DS --exclude-if-info "R2<=0.9" --make-pgen --out [pfilename]" to convert vcf file to pfile with dosage info, then use ""--pfile [pfilename] --maf 0.01 --hwe 0.000001  --make-pgen [pfilename2]" to QC and "--pfile [pfilename2] --covar --covar-number [number] --pheno [phenofilename] --ci 0.95 --out [regressionfilename]" for regression (which will automatically recognizes whether it is case/control or qtl) and report additive effect? 
Ideally I would like to perform both R2 and MAF filtering using --exclude-if-info in the first step when converting vcf to pfiles, but it seems Iike I cannot use same flag twice -- "Error: Duplicate --exclude-if-info flag".

Regards,
Alexis

在 2018年2月6日星期二 UTC+11下午12:01:39,Christopher Chang写道:

alexi...@gmail.com

unread,
Feb 5, 2018, 9:13:21 PM2/5/18
to plink2-users
Also can dosage data perform --maf and --hwe check?

在 2018年2月6日星期二 UTC+11下午12:41:23,alexi...@gmail.com写道:

alexi...@gmail.com

unread,
Feb 5, 2018, 9:43:33 PM2/5/18
to plink2-users
And is --glm in plink2 equal to --linear/--logistic in plink 1.x?

在 2018年2月6日星期二 UTC+11下午1:13:21,alexi...@gmail.com写道:

Christopher Chang

unread,
Feb 6, 2018, 1:27:37 PM2/6/18
to plink2-users
* For now, just include one --exclude-if-info filter in your initial VCF -> pgen conversion.  Once you have the data in pgen format, additional --exclude-if-info filters will run very quickly.
* Dosage data is used by the --maf filter, and ignored by --hwe (since the latter depends on genotype counts).
* Yes, plink2 --glm is equivalent to plink 1.x --linear/--logistic.

alexi...@gmail.com

unread,
Feb 6, 2018, 9:04:26 PM2/6/18
to plink2-users
Thanks for the reply. However I did get the log file indicating hwe check had been done, does it mean that my pfiles contain no dosage info? But the pfiles were converted from VCF with dosage=DS flag being used.


Below is the log file:
-------------------------------------------------------------------------------------------------
PLINK v2.00a1LM 64-bit Intel (3 Feb 2018)
Options in effect:
  --hwe 0.000001
  --maf 0.01
  --make-pgen
  --out chr20_maf0.01_hwe
  --pgen chr20.pgen
  --psam psam_for_SCZ_dosage.psam
  --pvar chr20.pvar

Hostname: hpc018
Working directory: /project/dosage
Start time: Tue Feb  6 17:39:15 2018

Random number seed: 1517899155
128947 MB RAM detected; reserving 64473 MB for main workspace.
Using up to 24 threads (change this with --threads).
746 samples (307 females, 439 males; 746 founders) loaded from
psam_for_SCZ_dosage.psam.
345928 variants loaded from chr20.pvar.
1 binary phenotype loaded (466 cases, 280 controls).
Calculating allele frequencies... done.
--hwe: 120 variants removed due to Hardy-Weinberg exact test (founders only).
144296 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
201512 variants remaining after main filters.
Writing chr20_maf0.01_hwe.pgen ... done.
Writing chr20_maf0.01_hwe.pvar ... done.
Writing chr20_maf0.01_hwe.psam ... done.

End time: Tue Feb  6 17:39:16 2018
---------------------------------------------------------------------------------------------
Thanks


alexi...@gmail.com

unread,
Feb 6, 2018, 9:32:33 PM2/6/18
to plink2-users
Also for --covar-col-nums [covarnumber], the number of the covarites should start from the third column (if the first two columns are FID and IID), i.e. --covar-col-nums 3, while in plink 1.x --covar-number [covarnumber], the number of the covariate starts from the column of the first covariate (ignore the FID/IID column), i.e. --covar-number 1, right? 

在 2018年2月7日星期三 UTC+11下午1:04:26,alexi...@gmail.com写道:

Christopher Chang

unread,
Feb 7, 2018, 12:27:01 PM2/7/18
to plink2-users
* When you import with dosage=DS, the .pgen file will usually contain dosages *and* best-guess genotypes.  --hwe uses the best-guess genotypes.
* You're correct about --covar-col-nums.  (This change is appropriate since FID will soon be made optional in practically all input files.)

O Kam

unread,
Apr 14, 2018, 12:36:15 PM4/14/18
to plink2-users
Dear Christopher, 

As you mentioned, plink2 --glm is equivalent to --linear/--logistic in plink 1.x. Does it, however, utilize the dosage data for imputed SNPs? And if yes, does it use hardcalls/best-guess genotypes or utilize the actual dosage/probability?

best regards,

- O

Christopher Chang

unread,
Apr 14, 2018, 12:50:49 PM4/14/18
to plink2-users
Yes, --glm uses actual dosage/probability when it's in the input file.  You just need to make sure they're imported properly, e.g. if you want to use the dosages in a VCF DS field, it's necessary to append 'dosage=DS' to your --vcf flag.

O Kam

unread,
Apr 14, 2018, 1:16:53 PM4/14/18
to plink2-users
That´s awesome! Thank you!

A follow-up question: 
I tried to run it (./plink2 --vcf [vcffilename] dosage=DS --pheno [phenofilename] --glm), but I got an error "Error: No entries in pheno.txt correspond to loaded sample IDs). 
The pheno-file presumably was as it was  supposed to be - tab-delimited, header starting with FID and IID fields, no spaces in any fields  etc (I also tried to do it with another file which was working earlier with plink 1.07 but got the same error)

And also checked manually and there were same IDs present in both vcf- and pheno-files. 

Is there any bugs in reading IDs from vcf-files or can I somehow check or debug the vcf-file entries?

- O

Christopher Chang

unread,
Apr 14, 2018, 1:26:41 PM4/14/18
to plink2-users
Default VCF ID import behavior was changed to "--const-fid 0" in mid-February, and this compatibility-breaking change distinguishes "alpha 2" from "alpha 1" builds.

If you have meaningful FID fields, and the IDs in your VCF file are of the form [FID][delimiter][IID], you now need to explicitly specify --id-delim.
If you do not, you should only have one ID column in pheno.txt, and its header line should start with #IID.

O Kam

unread,
Apr 14, 2018, 2:52:07 PM4/14/18
to plink2-users
Hi!

Thank you! This helped: removing FID field and updating to "alpha 2"-build worked and analysis ran smoothly!

Now, the behavior of --linear flag in plink 1.07 could be modified with --dominant or --recessive flags. Can the behavior of --glm flag in plink 2.0 be modified in similar manner?

Also, do the sex filters work with vcf-files? I have specified the gender in pheno-file, but apparently plink2 tries to load them from temporary pgen/pvar/psam files and fails to do it resulting in "0 females, 0 males, 10000 ambiguous". Can I somehow ask plink to read sex from phenotype file or should the information be extracted from vcf?

- O

Christopher Chang

unread,
Apr 14, 2018, 4:05:13 PM4/14/18
to plink2-users
1. "--glm dominant" and "--glm recessive" will do what you want.
2. --update-sex is the simplest workaround for now.  (Alternatively, if you use --vcf and --psam simultaneously, plink2 should verify the IDs are consistent with each other and load the pedigree/phenotype information in the .psam file, but this is bugged in the currently-posted build; a fix for that will be posted later today.)

O Kam

unread,
Apr 14, 2018, 4:17:35 PM4/14/18
to plink2-users
Great, so it is all working now in the new build!

For gender-specific analysis also separate files with male / female id-s worked (--keep females.txt). 

Last question: is the --qt-means flag supported by plink2?

- O

Christopher Chang

unread,
Apr 14, 2018, 4:20:27 PM4/14/18
to plink2-users
No, --qt-means isn't implemented yet, and isn't a high priority since it can't take advantage of dosage data.  Just use --make-bed followed by plink 1.x's qt-means.

O Kam

unread,
Apr 29, 2018, 5:53:11 AM4/29/18
to plink2-users
Ok, thank you! 

I tried that  with "--vcf filename --dosage=DS --make-bed" followed by "--bfile filename --pheno filename --assoc --qt-means" in plink 1.07. 
For some reason plink1.07 reports an error in reading BIM file at very end of the file on the line with X-chromosome SNP (ERROR: Problem reading BIM file, line 100). 
The same error occurs also with basic --freq command, so I think there is something wrong with my BIM but I don´t know what. 
Is there some issues associated with X-chromosome variant coding or use of --make-bed command with .vcf dosage files that I should be aware of?

- O

Christopher Chang

unread,
Apr 29, 2018, 10:48:30 AM4/29/18
to plink2-users
plink 1.07 has less flexible chromosome code parsing than 1.9 and 2.0. Your problem should go away with 1.9.

O Kam

unread,
Jun 7, 2018, 11:46:34 AM6/7/18
to plink2-users
Hi!

That worked fine! By the way, how Plink 1.9 arrives at genotype group (aa/aA/AA) averages (--qt-means) when starting initially with allelic dosages? E.g. where does it get genotype probabilities from, if initially dosage information was given as an input? I tried to get probabilities from dosages for some other analysis but it seemed quite tricky..

best regards, 

O

Christopher Chang

unread,
Jun 8, 2018, 12:06:33 PM6/8/18
to plink2-users
Most of plink 1.9 does not support allelic dosages at all.  If you give it a VCF file with e.g. a DS field, plink 1.9 always reads the GT field and ignores the dosages.  If you give it --gen/--bgen genotype probabilities, it normally saves the most likely genotype when uncertainty is less than 10%, and saves a missing call otherwise.  You need to use plink 2.0 if you want to work with genotype dosages.

--qt-means, of course, is a method that simply doesn't apply to dosages at all; and it's not alone.  Plink 2.0 handles this by saving a genotype hardcall entry along with each dosage; dosages in [0, 0.1] normally have a hardcall of 0, [0.9, 1.1] -> hardcall 1, [1.9, 2.0] -> hardcall 2, and everything else is assigned a missing value in this field.  You can change this with the --hard-call-threshold flag.
Reply all
Reply to author
Forward
0 new messages