Hi all
I'm trying to edit a reference genome by applying VCF variants to the reference fasta file in order to create a consensus sequence of the genome using VCFtools.
I first aligned my reads to the reference genome with Bowtie 2.0 and then used Stacks to process individual SAM files. While visualizing a subset of the SAM files together with the VCF file output by populations against the reference genome in IGV, I encountered three problems that prevent the editing of the reference genome using the VCF file:
1- The reference allele in the REF column of the VCF file does not always correspond to the allele of the reference genome. Instead the reference allele seems to be something like the allele that received the highest number of counts (as determined by pstacks in the .alleles.tsv file) in the first individual processed by cstacks (although I am not sure about that...)
2- The SNPs found into negatively orientated stacks are not edited to fit the reference genome that is positively orientated. For instance, let's consider a G position in the reference genome at which there is a G/T polymorphism in the population. If the G/T polymorphism is discovered within a negatively orientated stack, it appears as a C/A polymorphism in the .tags file output by pstacks, but should appear as a G/T polymorphism in the VCF file to be visualized against the reference.
3- The position of the SNPs found into positively orientated strands appear to be shifted one bp upstream (1 bp in the 5' direction) as compared to the actual polymorphism present in the SAM files. This is obvious when the VCF file is visualized against SAM files along the reference genome, but only occurs for positively orientated stacks. In parallel, I found the starting position of the positively orientated stacks in the .tags.tsv file also shifted one bp upstream, so this may be the origin of the problem.
Maybe some of these problems are due to my own data, so every experience of other users is welcome!
Many thanks,
Pierre-Alexandre
Hi Julian,
Thank you for your useful answers.
I was using Stacks v1.01, so the problem of the SNPs positions for positively orientated stacks may come from this old version. I will check if the problem gets fixed when running v1.06.
Concerning the SNPs found into negatively orientated stacks, they should be edited to their complement to match the reference genome. This is because the VCF format does not contain any information about read orientation. This does not matter that much for applications that only use individual genotypes. However, it matters for all the analyses that compare the information in the VCF file with that from a reference genome, as for instance to identify non-synonymous mutations or use an outgroup species to unfold the site frequency spectrum of an ingroup species.
You mentioned that Stacks places the majority rule allele in the REF column. If this is done based on all the individuals present in the VCF file, the frequency of the allele in the REF column should be at least 0.5. However the allele in the REF column is sometimes the minor frequency allele. Does it mean that the majority rule does not apply to all the individuals?
Another minor point concerning the VCF format, missing genotypes should be encoded as “./.” and not as “.” (sorry if this has been corrected since version 1.01)
Many thanks,
Pierre-Alexandre