missing values in imputed ukb data, plink2

587 views
Skip to first unread message

david anthony

unread,
Nov 7, 2020, 8:19:22 PM11/7/20
to plink2-users
Hi Chris, 

I am working with UKB imputed genomics data. I used plink2 to convert bgen files to vcf files and now I am trying to get dosage (00=0, 01=1,10=1,11=2). However, as you see, I have some other values such as ./.:0.5098 in my vcf files and when I use plink2 to convert vcf file to dosage, then I end up having some “NA” values in locations with ./. . If I am using imputed ukb data, then why do I have missing information in my dosage files? Also, how can I use the imputed ukb data in the correct way so as not to have "NA" in my dosage files? (Shall I add any flag in plink2?)

Thanks

Screen Shot 2020-10-22 at 2.34.44 PM.png

Christopher Chang

unread,
Nov 7, 2020, 11:49:45 PM11/7/20
to plink2-users

It sounds like you need to add "dosage=DS" to your --vcf flag invocation.

david anthony

unread,
Nov 8, 2020, 12:41:43 AM11/8/20
to plink2-users
Currently, I use the following three commands:
plink2 --bgen ukb_imp_chr1_v3.bgen ref-first --sample file.sample  --extract SNP_List.txt --make-pgen --out Ch1
plink2 --pfile Chr1 --export vcf bgz vcf-dosage=DS --out Chr1
plink2 --vcf Chr1.vcf.gz --recode A --out Chr1_dosage

And you recommend me to change third line to this:
plink2 --vcf dosage=DS Chr1.vcf.gz --recode A --out Chr1_dosage

Am I right? 

Thanks

david anthony

unread,
Nov 8, 2020, 1:03:36 AM11/8/20
to plink2-users
Apology for the typo. I meant should I replace this live with third line?
plink2 --vcf Chr1.vcf.gz dosage=DS --recode A --out Chr1

Jingqin Luo

unread,
Feb 25, 2021, 6:35:03 PM2/25/21
to plink2-users
Chris,

I am also analyzing the uk biobank imputed genotype data in bgen  files. 

The two lines are what we used to extract SNPs of interest and convert to dosage values.
 
plink2 --bgen ukb_imp_chr15_v3.bgen ref-first --sample sample.txt  --extract SNP.txt --make-bed --out Chr15

plink2 --bfile Chr15 --export A --out Chr15

(0) the header in the Chr15.raw file for a snp is in the format of rsID_T,  so, the allele suffix "T" should be the reference allele for the snp, right?

(1)What the dosage value of 0/1/2 code for?  The total number of major allele? the total number of reference allele? (seems surely not number of minor alleles) 

(2)I also saw the missing values in dosage values for some SNPs, which should not have missing values since the uk data is imputed.

(3) Is --recode A the above user used the same as --export A?

Thank you for your help in advance



Christopher Chang

unread,
Feb 25, 2021, 6:57:35 PM2/25/21
to plink2-users
0, 1. The .raw documentation notes that the "T" in rsID_T is the counted allele (i.e. when there's a dosage of "2", that means there are two copies of T).  The --export documentation notes that, by default, the REF allele is counted.
2. That's because you incorrectly used --make-bed/--bfile.  That file format cannot store dosages.  Use --make-pgen/--pfile instead.
3. Yes, they're equivalent.  --recode is deprecated because, in plink 1.x, --recode has an awful default file format.

Jingqin Luo

unread,
Feb 25, 2021, 9:10:30 PM2/25/21
to plink2-users
Thanks. Chris. 
I changed to --make-pgen/-pfile.  Now the dosage values of a SNP with column header rs***_T  in the .raw file are not  discrete values 0, 1, 2 but continuous between a continuous value between 0 and 2 ?  In that situation, i guess dosage=2*P(genotype=TT) + 1* P(genotype=AT or AT)?  But what if I acutally need the dosage of the alternative allele A??    
Note that in the 0/1/2 discrete coding case, the dosage of counting A vs. counting T can be converted by x vs. 2-x,  
But under the dosage of dosage=2*P(genotype=TT) + 1* P(genotype=AT or AT) , how should we convert to the dosage of allele A?

Jingqin Luo

unread,
Feb 26, 2021, 11:47:50 AM2/26/21
to plink2-users
Chris, FYI, I used 2-x to derive dosage for alternative allele and it seems to be good in the sense that I got the frequency as expected.

--
You received this message because you are subscribed to a topic in the Google Groups "plink2-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/plink2-users/hjgWQ5XNuGQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/72f22ee4-9380-40e8-bda5-532e0b5e2376n%40googlegroups.com.

Christopher Chang

unread,
Feb 26, 2021, 1:53:34 PM2/26/21
to plink2-users
Yes, 2-x always works for biallelic variants.  --export-allele can be used to control the counted alleles in the general case.
Reply all
Reply to author
Forward
0 new messages