map + ped to VCF

118 views
Skip to first unread message

DBScan

unread,
Nov 16, 2022, 5:01:57 AM11/16/22
to plink2-users
I received data from a SNP array (hg19) and I would like to convert them to a VCF. (hg38). Unfortunately, PLINK somehow doesnt get the REF allele right. Currently, I am doing the following: 
Create the bed/bim/fam first:
`plink --ped {PED_FILE} --map {MAP_FILE} --out plink --snps-only just-acgt --real-ref-alleles --chr 1-26 --make-bed`
Create the VCF (plink2 version 3.7):
`plink2 --bed plink.bed --bim plink.bim --fam plink.fam --fa hs37d5.fa.zst --ref-from-fa --snps-only just-acgt --export vcf --out plink.vcf`

In the last step I do a LiftOver with GATK, but around 20% of the variants get removed because they contain the wrong REF allele.

Where is my error?

Christopher Chang

unread,
Nov 16, 2022, 11:57:42 AM11/16/22
to plink2-users
Are you sure you are providing the correct file as input to the GATK LiftOver?  Your "--out plink.vcf" command line won't actually be allowed by future plink2 builds, because it's basically a bug; see the discussion of --allow-misleading-out-arg.

DBScan

unread,
Nov 16, 2022, 12:24:31 PM11/16/22
to plink2-users
Yes.

What I basically did was fixing the reference alleles with bcftools norm, the --ref-from-fa does not seem to fix these alleles.

Christopher Chang

unread,
Nov 16, 2022, 12:37:29 PM11/16/22
to plink2-users
Please post an example plink.{bed,bim,bam} and plink.vcf.vcf fileset (could contain just 1 variant) where the wrong REF allele comes out of your second command.

DBScan

unread,
Nov 17, 2022, 9:45:38 AM11/17/22
to plink2-users
First line of the VCF with a variant which failed LiftOver:

chr1    752721  rs3131972       C       .       .       MismatchedRefAllele     AttemptedAlleles=C*->;AttemptedLocus=chr1:817341-817341;PR      GT      0/0

plink.fam:

1 Sample 0 0 2 -9

plink.bim (chose the same line as VCF):

1       rs3131972       0       752721  0       C

First line of plink.bed:

00000000: 01101100 00011011 00000001 00000011 00000011 00000011  l.....

Is this useful?

Zuxi Cui

unread,
Nov 17, 2022, 11:49:32 AM11/17/22
to DBScan, plink2-users
Hi all,

The first SNP listed here reports no "C" allele in the dbSNP references. (https://www.ncbi.nlm.nih.gov/snp/rs3131972)
However, your data says it is monomorphic to "C" at the location. You may double-check the quality of the data to figure it out.

Terry

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/36d0bdcb-43f3-4aa6-9624-4acfaf77b5fbn%40googlegroups.com.

Zuxi Cui

unread,
Nov 17, 2022, 11:51:25 AM11/17/22
to DBScan, plink2-users
BTW, I suspect it was a strand issue.

Christopher Chang

unread,
Nov 17, 2022, 12:34:05 PM11/17/22
to plink2-users
--ref-from-fa currently only swaps REF/ALT alleles when appropriate, it does not fill in the REF allele when it is not present at all in your .bim file.  You need to use --ref-allele (with something like the 1000 Genomes phase 3 VCF) to fill in missing REF alleles from legacy .map + .ped data.

I'll look into either extending the functionality of --ref-from-fa or making the documentation clearer on this point.

DBScan

unread,
Nov 23, 2022, 8:16:18 AM11/23/22
to plink2-users
Thanks for clarification, I will try with --ref-allele from dbSNP hg19 version and report back.

DBScan

unread,
Nov 24, 2022, 5:01:32 AM11/24/22
to plink2-users
The error still persists.

In my VCF generated by PLINK, the first 5 columns are these:
"1       838555  rs4970383       T       G"

In my dbSNP VCF, the corresponding line is:
"NC_000001.11    903175  rs4970383       C       A,G"

The REF allele should therefore be C and not a T if I am not mistaken.

I used the following PLINK command:
```plink2 --bfile plink --ref-allele REF_alleles.tsv 4 3 '#' --snps-only just-acgt --export vcf --out /scratch/GIS/plink```

I get many warnings like this:
"Warning: --ref-allele mismatch for biallelic variant 'rs4970383'."

What am I doing wrong here?

Christopher Chang

unread,
Nov 24, 2022, 10:30:58 AM11/24/22
to plink2-users
You cannot freely mix reference builds in the middle of your workflow.  A few SNPs, like rs4970383, are mapped to one strand in hg19 and the other in hg38.
Reply all
Reply to author
Forward
0 new messages