How to properly output ref/alt alleles?

1,188 views
Skip to first unread message

Satshil Rana

unread,
Sep 11, 2017, 12:30:23 PM9/11/17
to plink2-users
Hi,

I have a dataset containing dbSNP IDs and positions which is in the plink binary format. I want to output as a VCF. Using the --recode VCF flag, along with --alleleACGT I get a properly coded VCF file, but the ref/alt sequences do not match the dbSNP ref/alts, but the positions and RSids match. I'm wondering how I can recode my file to vcf and keep the right ref/alts?

Thanks.

Satshil Rana

unread,
Sep 11, 2017, 12:41:34 PM9/11/17
to plink2-users
Just to clarify, I am using the latest plink 1.9 on linux and my whole command is:

plink_1.9/plink -bfile dbSNPs --recode vcf-iid --keep-allele-order --alleleACGT --out dbSNPs_recoded_keep_allele


Christopher Chang

unread,
Sep 11, 2017, 12:41:49 PM9/11/17
to plink2-users
plink 1.x does not normally keep track of reference alleles; it's necessary to import them from another source with --a2-allele.

Satshil Rana

unread,
Sep 11, 2017, 12:45:16 PM9/11/17
to plink2-users
Can you please clarify on this? I saw this option on the manual but I'm a bit confused. How would I go about importing the ref allele for a hg19 dataset? The only "source" file I have is the plink dataset (bed/bim/fam).

Thank you Christopher! 

Christopher Chang

unread,
Sep 11, 2017, 12:52:20 PM9/11/17
to plink2-users
This is impossible if you *only* have the plink dataset.  However, if you also have e.g. a 1000 Genomes or other reference-dataset VCF with the same variant IDs, "--a2-allele [reference VCF] 4 3 '#'" imports all the reference allele settings from the VCF.

I may add an option to import from a .fa file, though this wouldn't work for deletions and some other length-changing variations.

Satshil Rana

unread,
Sep 11, 2017, 5:11:25 PM9/11/17
to plink2-users
Perfect, thanks! I tried to use the dbSNP database and while there are matching IDs between the two, PLINK says that there are no changes. Upon checking plink1.9 the "ID" column is just the chrom:pos:ref:alt instead of the rsID. plink2's export vcf function properly exports the ID as rsIDs. How would I be able to use the VCF as input for the ref/alt alleles and have the output vcf have rsIDs as output? The dbSNP is formatted as standard vcf, col 1 is chrom, 2 is pos, 3 is ID, 4 is ref, 5 is alt.

Thank you.

Satshil Rana

unread,
Sep 13, 2017, 12:19:20 PM9/13/17
to plink2-users
After some more troubleshooting and experimenting I'm experiencing a new error:

plink --bfile dbSNPs --alleleACGT --recode vcf-iid --a2-allele All_20170710.vcf 4 3 '#' --out test_dbSNP

gives me the following error for every rsID 

Warning: Impossible A2 allele assignment for variant <ID>.

end file still doesn't contain proper ref sequence.

Any further tips?

Thanks!

Christopher Chang

unread,
Sep 13, 2017, 12:26:19 PM9/13/17
to plink2-users
Well, check what's happening with that variant ID.  It looks like the VCF is inconsistent with the dbSNPs file, and it's important to figure out how this could happen.

Satshil Rana

unread,
Sep 13, 2017, 4:23:14 PM9/13/17
to plink2-users
It's more so all the entries are giving that error. The vcf I obtained was straight from NCBI's database, so I'm going to assume its properly formatted. I ended up creating a short awk script which updated the ref columns for the PLINK formatted VCF with the base from NCBI's vcf.
 
awk 'BEGIN {OFS="\t"}; FNR==NR{a[$3]=$4;next}$3 in a{$4=a[$3]}1' All_20170710_noheaders.vcf plink_noheader.vcf > proper_ref_noheader.vcf


That should format the ref seq's to the proper ones. The "All_20170710" file is from NCBI's dbSNP for hg19 and the plink vcf is the output from the previous command. the alt's look to be fine.

Thanks!
Reply all
Reply to author
Forward
0 new messages