Correctly coding indels in plink

457 views
Skip to first unread message

Mwangana Mubita

unread,
Jul 22, 2023, 11:40:51 AM7/22/23
to plink2-users
Hi all,
I am relatively new to plink though I can carry out most analyses involving SNPs.
I have a challenge with how to correctly code indels in plink (I plan to convert my binary file set to VCF and use it for lift over in down stream analysis).

For instance, my dataset has an indel with rsID rs72552763 and, alleles ATGAT and AT. Its position is chr6: 160139849 - 160139853 (GRCh 38.p14).

In the .map file, which bp position should I indicate?

With respect to the .ped file, how should I code the alleles/genotypes?

I will appreciate guidance on this.

Best regards,

Mwangana


Christopher Chang

unread,
Jul 22, 2023, 12:26:28 PM7/22/23
to plink2-users
The .ped + .map format isn't really sufficient to represent indels at all, because it does not preserve allele order.

Instead, use the VCF format, plink 2.0 --vcf to import it, and try to use plink 2.0 for all other data management operations, because plink 2.0 preserves allele order while plink 1.x does not.

Mwangana Mubita

unread,
Aug 5, 2023, 11:33:12 AM8/5/23
to plink2-users
have map/ped files that I am converting to pgen and then vcf. My genotyping data is not in VCF.
My dataset has an indel rs 72552763 with REF allele ATGAT (or ATGA as shown on ensembl) and ALT is AT (or A as shown on ensembl).

I have coded alleles of of the indel the ped file as REF ( i.e. ATGAT) is 3 and ALT (i.e. AT) is 4.
I have run the following commands:
1. plink2 --pedmap filename --fa Homo_sapiens_GRCh38.dna.chromosome.6.fa --ref-from-fa  --make-pgen --out filename
2. plink2 --pgen filename.pgen --pvar filename.pvar --psam filename.psam --export vcf --out filename
A vcf file is created but there is the warning "Warning: At least one VCF allele code violates the official specification;other tools may not accept the file.  (Valid codes must either start with a '<', only contain characters in {A,C,G,T,N,a,c,g,t,n}, be an isolated '*', or
represent a breakend.)

How can I correctly code these alleles so that I get a vcf with appropriate allele codes? Given my dataset in not received as a VCF, how should I create a VCF for down stream analysis i.e. haplotype phasing and genotype imputation?
I will appreciate a simple explanation with clear steps that I can follow. Thanks in advance.

Chris Chang

unread,
Aug 5, 2023, 11:51:31 AM8/5/23
to Mwangana Mubita, plink2-users
Unfortunately, the only simple solution is to throw out all your indels with e.g. —snps-only.  .ped/.map CANNOT unambiguously represent indels in a way that works with —ref-from-fa.  You effectively have to backtrack and use what amounts to VCF representation at an earlier step.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/1fa45d99-57a0-4f98-a17f-3a9d939b65cdn%40googlegroups.com.

Mwangana Mubita

unread,
Aug 5, 2023, 12:05:13 PM8/5/23
to Chris Chang, plink2-users
Thanks for very quick response. That helps a lot.

Mwangana Mubita

unread,
Jun 10, 2024, 3:08:06 PM6/10/24
to plink2-users
Just in case someone has this issue - I did not have to throw away the indels as I was able to build a VCF file using VCF-Simplify found on the link below: 

Best regards,
Reply all
Reply to author
Forward
0 new messages