set missing genotypes to homozygous major allele genotypes

604 views
Skip to first unread message

freeseek

unread,
Mar 5, 2014, 7:48:13 PM3/5/14
to plink2...@googlegroups.com
This might sound like a weird request, but it would be nice to have an option to set missing genotypes to homozygous major allele genotypes. It is possible that, after merging two exome datasets, you end up with lots of rare variants with missing genotypes, as the sets of rare variants might be different in two different exome datasets. It seem unorthodox, but it would come handy if it was possible to just reset those missing genotypes to homozygous major allele genotypes.

Christopher Chang

unread,
Mar 6, 2014, 1:30:32 AM3/6/14
to plink2...@googlegroups.com
If you have a reference genome file, this is doable with --merge-mode 5.  The VCF reference merge example (https://www.cog-genomics.org/plink2/data#merge_vcf_example ) spells out the steps of a similar operation.

freeseek

unread,
Mar 6, 2014, 9:47:53 AM3/6/14
to plink2...@googlegroups.com
If I understand correctly, "--merge-mode 5" will overwrite calls with merge conflicts. My issue is different. I will explain with a typical example. Exome dataset A has a singleton variant. Exome dataset B, at the position of the singleton variant, has no alternate allele. When merging the two datasets, samples from dataset B will show up as unknown genotype, but really they should be homozygous reference. The real issue here stems from the fact that the VCF file for the two datasets does not encode information about which base pairs are homozygous reference and which base pairs are unknown genotype. I actually used CombineVariants from GATK to merge two VCF exome datasets, and the problem might be better addressed by the team that wrote CombineVariants for GATK. But I thought I would mention it here as well in case other people end up having the same problem.

I suppose some tricks to convert the missing calls in the plink binary bed file can be engineered outside of plink as well. If anybody has a good suggestion I would like to hear it.

Christopher Chang

unread,
Mar 6, 2014, 9:54:00 AM3/6/14
to plink2...@googlegroups.com
Assuming reference_genome and exome_dataset_b have matching FID/IIDs and include the same variants,

plink --bfile reference_genome --bmerge exome_dataset_b --merge-mode 5 --out [new prefix]

will use the exome dataset B call whenever it isn't missing, and use the reference genome otherwise.  Is this not quite what you want?

Christopher Chang

unread,
Mar 6, 2014, 10:20:05 AM3/6/14
to plink2...@googlegroups.com
Correction: this should be a 2- or 3-step process.

1. (if you have a reference genome file that contains more variants than you want)
Create a reference genome file which contains only the variants you want in the final file.

plink --bfile exome_dataset_b --write-snplist
plink --bfile reference_genome --extract plink.snplist --make-bed --out reference_genome_pruned

2. Remove all variants with missing calls from exome_dataset_b.

plink --bfile exome_dataset_b --geno 0.5 --make-bed --out exome_dataset_b_nomiss

3. Now merge mode 5 should do what you want?

plink --bfile reference_genome_pruned --bmerge exome_dataset_b_nomiss --merge-mode 5 --out [new prefix]

freeseek

unread,
Mar 6, 2014, 10:27:01 AM3/6/14
to plink2...@googlegroups.com
This is a great solution. In my case I just cannot merge the VCF files with plink, as I also need the combined VCF dataset with all the information that plink would drop when converting from VCF to plink format. Thank you nevertheless.

Christopher Chang

unread,
Mar 6, 2014, 8:20:56 PM3/6/14
to plink2...@googlegroups.com
Yeah, that's definitely outside PLINK's scope for now.

One question: do you need *all* the information in the VCF file, or just a few additional things that PLINK currently has to drop?  If the latter, I can at least make sure that the PLINK 2 file format retains everything else you need.

freeseek

unread,
Mar 6, 2014, 8:33:00 PM3/6/14
to plink2...@googlegroups.com
I think PLINK 2 is fine as it is now.

freeseek

unread,
Mar 10, 2014, 8:22:33 PM3/10/14
to plink2...@googlegroups.com
The following c code will compile to an executable which will take a binary plink bed file from standard input and will yield a binary plink bed file to standard output with all missing genotypes modified to homozygous A2 calls (which should correspond to homozygous major allele and homozygous reference allele for rare variants). By no means this code is fast though.



#include <unistd.h>

void main() {
  char in;
  read(0,&in,1);
  write(1, &in,1);
  read(0,&in,1);
  write(1, &in,1);
  read(0,&in,1);
  write(1, &in,1);
  while(read(0,&in,1)) {
    in |= (in & 0x55) << 1;
    write(1, &in,1);
  }
}
Reply all
Reply to author
Forward
0 new messages