output with all sites

Julie Jacquemin

unread,

Oct 31, 2014, 6:15:49 AM10/31/14

to stacks...@googlegroups.com

Hello,

I'm using Stacks on GBS data, and the population program to extract vcf files and get basic population genetic parameters.

However, for the purpose of my analysis (calcul of Pi after filtering for specific parameters for example), i would need to get a VCF file, or any other output that i could transform into vcf, of all the sites (variants + invariants) and not just the SNPs.

Is it possible with the current population program? All i see is the total number of these sites in the sumstat-summary. If not, will it be possibly implemented in newer versions?

Thanks a lot for your answers.

Best

Julie

Julian Catchen

unread,

Nov 10, 2014, 8:25:10 PM11/10/14

to stacks...@googlegroups.com, jacquemi...@gmail.com

Hi Julie,

It is not currently possible to get a VCF file with all fixed sites in
it. We do provide the 'genomic' output which includes all nucleotide
sites in the data set.

Why do you also want the invariant sites in the VCF? This could be
implemented in the populations program if it would be widely beneficial.
It could also be generated from the batch_X.haplotypes.tsv file with
some scripting.

Best,

julian

Julie Jacquemin

unread,

Nov 17, 2014, 9:57:13 AM11/17/14

to stacks...@googlegroups.com, jacquemi...@gmail.com, jcat...@uoregon.edu

Hi Julian,

Thanks a lof for your answer.

What i want to do is to calculate Nucleotide diversity Pi for my data. I know that Stacks populations provides a value for that but unfortunately this is not enough for me, because i want to calculate Pi on my data sorted for specific parameters.

For example only sites which are present in at least 2 individuals, or only sites that are present in at least 50% of the individuals...

For this last option i can not use the -r option of Stacks because it sorts on the percent of individuals present in the populations for the site to be processed, and not on the final number of genotype/SNP called by the program, which means that most of the time i end up with less minimum individuals that i wanted to if i use this option, because for some individuals a statistically significant SNP call could not be made in one nucleotide position.

To solve that i started sorting my vcf output directly with my own criteria and developed a script to calculate the Pi from the VCF. But for that to work i also need to know the total number of sites (fixed + variants) i had initially (necessary for the Pi calculation). I have to be able to filter the total number of sites also for the same criteria as my SNP vcf file (ex at least 50% individuals present), to get the correct value of Pi for each set of parameters applied on the data.

That's the reason why a VCF output of all sites, not just SNPs would be very useful to me. I will check more the 'genomic' output, to see if there is a way to transform it into a VCF file, in which case my problem would be solved.

I hope i was clear enough.

Best

Julie

Julie Jacquemin

unread,

Nov 18, 2014, 4:46:49 AM11/18/14

to stacks...@googlegroups.com, jacquemi...@gmail.com, jcat...@uoregon.edu

Dear Julian,

Unfortunately i can not find the description of the genomic output, that you suggested me to look at, anywhere in the Manual/Paper/google group or online. I'm not sure how to interpret the coding 0 1 5 8 10 in the file. Where can i find this information?

Julian Catchen

unread,

Dec 3, 2014, 4:50:17 PM12/3/14

to stacks...@googlegroups.com, jacquemi...@gmail.com

Hi Julie,

The genomic output formats SNPs/homozygotes according to this matrix
(found in the genotype_dictionaries.h file):

A C G T
{1, 2, 3, 4} A
{2, 5, 6, 7} C
{3, 6, 8, 9} G
{4, 7, 9, 10} T

So, an A/A homozygote is encode as 1, while an C/G heterozygote is
encoded as 6.

Reply all

Reply to author

Forward