Reference allele and SNP position in VCF file

1,511 views
Skip to first unread message

pagag...@gmail.com

unread,
Sep 16, 2013, 1:13:30 PM9/16/13
to stacks...@googlegroups.com

Hi all


I'm trying to edit a reference genome by applying VCF variants to the reference fasta file in order to create a consensus sequence of the genome using VCFtools.

I first aligned my reads to the reference genome with Bowtie 2.0 and then used Stacks to process individual SAM files. While visualizing a subset of the SAM files together with the VCF file output by populations against the reference genome in IGV, I encountered three problems that prevent the editing of the reference genome using the VCF file:


1- The reference allele in the REF column of the VCF file does not always correspond to the allele of the reference genome. Instead the reference allele seems to be something like the allele that received the highest number of counts (as determined by pstacks in the .alleles.tsv file) in the first individual processed by cstacks (although I am not sure about that...)


2- The SNPs found into negatively orientated stacks are not edited to fit the reference genome that is positively orientated. For instance, let's consider a G position in the reference genome at which there is a G/T polymorphism in the population. If the G/T polymorphism is discovered within a negatively orientated stack, it appears as a C/A polymorphism in the .tags file output by pstacks, but should appear as a G/T polymorphism in the VCF file to be visualized against the reference.


3- The position of the SNPs found into positively orientated strands appear to be shifted one bp upstream (1 bp in the 5' direction) as compared to the actual polymorphism present in the SAM files. This is obvious when the VCF file is visualized against SAM files along the reference genome, but only occurs for positively orientated stacks. In parallel, I found the starting position of the positively orientated stacks in the .tags.tsv file also shifted one bp upstream, so this may be the origin of the problem.


Maybe some of these problems are due to my own data, so every experience of other users is welcome!


Many thanks,

Pierre-Alexandre

Julian Catchen

unread,
Sep 18, 2013, 12:43:48 PM9/18/13
to stacks...@googlegroups.com, pagag...@gmail.com
Hi Pierre-Alexandre,

Some answers below.

On 9/16/13 10:13 AM, pagag...@gmail.com wrote:
> I'm trying to edit a reference genome by applying VCF variants to the reference
> fasta file in order to create a consensus sequence of the genome using VCFtools.
>
> I first aligned my reads to the reference genome with Bowtie 2.0 and then used
> Stacks to process individual SAM files. While visualizing a subset of the SAM
> files together with the VCF file output by populations against the reference
> genome in IGV, I encountered three problems that prevent the editing of the
> reference genome using the VCF file:
>
>
> 1- The reference allele in the REF column of the VCF file does not always
> correspond to the allele of the reference genome. Instead the reference allele
> seems to be something like the allele that received the highest number of counts
> (as determined by pstacks in the .alleles.tsv file) in the first individual
> processed by cstacks (although I am not sure about that...)
>

That's correct. Stacks calls SNPs with respect to it's own data, not with
respect to the reference genome, so there is no guaruntee that either of the
alleles called in a SNP will match the reference, so Stacks places the majority
rules allele in the REF column.

>
> 2- The SNPs found into negatively orientated stacks are not edited to fit the
> reference genome that is positively orientated. For instance, let's consider a G
> position in the reference genome at which there is a G/T polymorphism in the
> population. If the G/T polymorphism is discovered within a negatively orientated
> stack, it appears as a C/A polymorphism in the .tags file output by pstacks, but
> should appear as a G/T polymorphism in the VCF file to be visualized against the
> reference.
>

That's a good point, I hadn't considered this before. Does the VCF standard say
anything about this?

>
> 3- The position of the SNPs found into positively orientated strands appear to
> be shifted one bp upstream (1 bp in the 5' direction) as compared to the actual
> polymorphism present in the SAM files. This is obvious when the VCF file is
> visualized against SAM files along the reference genome, but only occurs for
> positively orientated stacks. In parallel, I found the starting position of the
> positively orientated stacks in the .tags.tsv file also shifted one bp upstream,
> so this may be the origin of the problem.
>

This is due to the fact that basepairs in SAM files start counting at position
1, but Stacks (and BAM files) starts counting at position 0 (which is convenient
for internal calculations). If you compare against the BAM files it should match.

Perhaps you could give me some examples of this behavior so I can double check
the code.

Also, just to make sure, are you using Stacks 1.06? We have corrected a few
off-by-one errors in the SNP positions recently.

>
> Maybe some of these problems are due to my own data, so every experience of
> other users is welcome!
>
>
> Many thanks,
>
> Pierre-Alexandre
>

Best,

julian

pierre-alexandre gagnaire

unread,
Sep 20, 2013, 11:07:15 AM9/20/13
to stacks...@googlegroups.com, jcat...@uoregon.edu

Hi Julian,

 

Thank you for your useful answers.

I was using Stacks v1.01, so the problem of the SNPs positions for positively orientated stacks may come from this old version. I will check if the problem gets fixed when running v1.06.

Concerning the SNPs found into negatively orientated stacks, they should be edited to their complement to match the reference genome. This is because the VCF format does not contain any information about read orientation. This does not matter that much for applications that only use individual genotypes. However, it matters for all the analyses that compare the information in the VCF file with that from a reference genome, as for instance to identify non-synonymous mutations or use an outgroup species to unfold the site frequency spectrum of an ingroup species.

You mentioned that Stacks places the majority rule allele in the REF column. If this is done based on all the individuals present in the VCF file, the frequency of the allele in the REF column should be at least 0.5. However the allele in the REF column is sometimes the minor frequency allele. Does it mean that the majority rule does not apply to all the individuals?

Another minor point concerning the VCF format, missing genotypes should be encoded as “./.” and not as “.” (sorry if this has been corrected since version 1.01)


Many thanks,
Pierre-Alexandre

Julian Catchen

unread,
Sep 21, 2013, 5:07:58 PM9/21/13
to stacks...@googlegroups.com, pierre-alexandre gagnaire
Hi Pierre-Alexandre,

Thanks for the feedback. I have corrected the points you raised (complementing
alleles from the negative strand, marking missing genotypes, ordering Reference
allele as most frequent).

Could you please try out the latest release before I make it public and let me
know if the changes work in your dataset or if you see any other problems.

http://creskolab.uoregon.edu/stacks/source/stacks-1.07.tar.gz

Best,

julian

pierre-alexandre gagnaire

unread,
Sep 23, 2013, 10:23:42 AM9/23/13
to stacks...@googlegroups.com, pierre-alexandre gagnaire, jcat...@uoregon.edu
Hi Julian,

Thanks for your corrections. I have installed version 1.07 which fixes the problem of the alleles from the negatively oriented stacks. The reference allele is now the most frequent one and the VCF is fully compatible with IGV.
However, one problem remains for the SNPs found into positively oriented stacks, which are still shifted one base pair 5' compared to their true position on the reference genome.
To illustrate the problem, I have attached two snapshots, one showing a SNP in a negatively oriented stack (no problem anymore) and one SNP found in a positively oriented stack (shifted 1 bp on the left). In these snapshots, you can see a VCF file made from 10 individuals on the top, and just below the SAM file of the first individual in the VCF (named ANN_S_01). The position of the SNP in the SAM file matches the reference genome, but not in the VCF. I seems that adding +1 to the position of every SNP found into positively oriented stacks would fix the problem.

Best,
Pierre-Alexandre
SNP_in_negative_stack_at_good_position.png
SNP_in_positive_stack_shifted_in_VCF.png

Julian Catchen

unread,
Sep 23, 2013, 8:36:34 PM9/23/13
to stacks...@googlegroups.com, pierre-alexandre gagnaire
Hi Pierre-Alexandre,

This problem should now be resolved. Unfortunately, you will have to run the
pipeline back from the pstacks stage to see the off-by-one error corrected.

Best,

julian

On 9/23/13 7:23 AM, pierre-alexandre gagnaire wrote:
> Hi Julian,
>

pierre-alexandre gagnaire

unread,
Sep 24, 2013, 3:52:52 AM9/24/13
to stacks...@googlegroups.com, pierre-alexandre gagnaire, jcat...@uoregon.edu
Hi Julian,

The problem is fully resolved now. Thanks a lot for your rapid modifications of the code!

Best
Pierre
Reply all
Reply to author
Forward
0 new messages