vcf file bugs

386 views
Skip to first unread message

Jason

unread,
Apr 13, 2015, 5:09:28 PM4/13/15
to stacks...@googlegroups.com

Hi Julian and other users,

I think there may be some bugs in the output of vcf files in the latest version of STACKS (v1.29). here is a snippet of my VCF file for illustrative purposes:

1  un   28  1  C  T  . PASS NS=50;AF=0.970,0.030 GT:DP:AD:GL        ./.:0:.,.:.,.,.       ./.:0:.,.:.,.,. 0/0:16:8,8:.,22.18,. 0/0:11:5,6:.,15.25,.
2  un   77  2  A  T  . PASS NS=37;AF=0.973,0.027 GT:DP:AD:GL   0/0:13:6,7:.,18.02,.   0/0:9:4,5:.,12.48,. 0/0:7:3,4:.,9.7,. 0/0:12:6,6:.,16.64,.
4  un  201  4  C  T  . PASS NS=51;AF=0.902,0.098 GT:DP:AD:GL 0/1:26:11,15:.,36.04,.  0/0:14:7,7:.,19.41,. 0/0:11:5,6:.,15.25,. 0/0:25:12,13:.,34.66,.
5  un  415  7  C  T  . PASS NS=20;AF=0.925,0.075 GT:DP:AD:GL        ./.:0:.,.:.,.,.       ./.:0:.,.:.,.,. 0/0:5:2,3:.,6.93,. 0/0:5:2,3:.,6.93,.
6  un  479  8  G  T  . PASS NS=46;AF=0.935,0.065 GT:DP:AD:GL   0/0:11:5,6:.,15.25,.  0/0:11:5,6:.,15.25,. 0/0:4:2,2:.,5.55,. 0/0:17:8,9:.,23.57,.

ISSUE 1: Why do allele depths (AD) always appear in increasing order for the depths of the two alleles within a sample? I expected this to be random which makes me suspect something is amiss.
ISSUE 2: The reported allele depths do not seem to equate with the reported genotypes as homozygous (0/0, 1/1) or heterozygous (0/1). For example, you can see that all of the reported genotypes except 1 are 0/0 despite all of them having  coverage for both alleles.
I only get a heterozygote called once when allele coverage was very high. certain the genotype calling model does not report a depth pf 7,7 to be a homozygote?
ISSUE 3: Why do all my loci appear to be heterozygotes based on the allele depth (AD) profiles? That is impossible!

Perhaps I am misunderstanding the vcf output here, or perhaps there are some serious BUGS in the VCF output.

Best wishes,

Jason

mathi

unread,
Apr 16, 2015, 12:31:13 PM4/16/15
to stacks...@googlegroups.com
Hi there,
can't comment on those problems Jason, as I am still on an older STACKS version that does not output reads per allele.
I found another issue that is definetely a bug in the vcf output (perhaps other outputs as well):

Example:
I run populations -r 0.75 on a population_map with 9 samples. 9*0.75 should be 6.75, so minimum NS=7 individuals to include a given site. However, in the vcf there are many thousands of sites that have NS<7:

un    45847    478    T    C    .    PASS    NS=6;AF=0.833,0.167    GT:DP:GL    1/0:6:.,-6.44833,.    1/0:4:.,-4.39445,.    ./.:0:.,.,.    0/0:5:.,.,.    0/0:33:.,.,.    0/0:7:.,.,.    0/0:11:.,.,.    ./.:4:.,.,.    ./.:0:.,.,.
un    46555    485    G    A    .    PASS    NS=4;AF=0.750,0.250    GT:DP:GL    1/1:8:.,.,.    ./.:8:.,.,.    0/0:6:.,.,.    ./.:5:.,.,.    0/0:9:.,.,.    ./.:6:.,.,.    0/0:4:.,.,.    ./.:0:.,.,.    ./.:0:.,.,.
un    46584    486    C    T    .    PASS    NS=8;AF=0.875,0.125    GT:DP:GL    ./.:0:.,.,.    0/0:16:.,.,.    0/0:14:.,.,.    0/0:10:.,.,.    0/0:13:.,.,.    0/0:14:.,.,.    1/1:5:.,.,.    0/0:13:.,.,.    0/0:6:.,.,.

How is this possible? Is it related to the issue (which I'd call another bug) that SNPs / positions within the same RADtag do not necessarily have the same NS / individual missingness?
I can run vcftools to filter these sites out, but that should not be required.

I am trying to find out how many variant AND invariant RADtags pass a given -m and -r threshold (the sumstats.summary.tsv lists a total number of observed sites, but strangely the number there is NOT an integer mutliple of the length of the RADtags... So I cannot trust it.). STACKS does not allow me to do this, even when refetching the coverage by mining the matches.tsv files, since I have no clue how to interpret the missingness dropout. Have I or have I not made observations at  these sites? (in the sense of calculating an "absolute" nucleotide diversity, that does include also invariant sites that are in entirely invariant RADtags)

regards,
Mathias



mathi

unread,
Apr 16, 2015, 12:36:59 PM4/16/15
to stacks...@googlegroups.com
Ah something I just noticed now: in the cases where NS<7, there are actually enough reads in the missing samples, but their genotypes have been removed (see above). So it seems the outputting script is considering the reads rather than the actual genotype presence to make the decision... I guess those genotypes were excluded by some filtering thing, unfortunately it is impossible to find out why exactly...

Mark Ravinet

unread,
Apr 17, 2015, 2:43:00 AM4/17/15
to stacks...@googlegroups.com
Hi Mathias,

If I recall correctly from a previous discussion here, the -r flag in populations filters based on the presence/absence of the RAD locus, not the SNP genotypes.

One approximate way to identify the number of variant/invariant loci is to examine the haplotypes and vcf files. Run the following command on the vcf to count the number of RAD loci that SNPs are called on.

tail -n +10 batch1.vcf | cut -f 3 | uniq | wc -l

Then use this on the haplotypes file to get a count of all the RAD tags that passed your populations filters.

 tail -n +2 batch1.haplotypes.tsv | wc -l 

 The difference between those two counts will be the number of invariant RAD loci. However I think that a few other filters are applied to the haplotypes/vcf files so this number will not be exact.

Hope that helps

Mark

Julian Catchen

unread,
Apr 20, 2015, 4:02:29 PM4/20/15
to stacks...@googlegroups.com, msp...@gmail.com
Hi Mathias,

This has been discussed many times on this mailing list and Mark has
given the answer in a later message.

The behavior of the -r and -p filters were based on the presence or
absence of the RAD locus, which can contain one or more SNPs within it,
in each individual. The reason for this was purely technical: to
determine if each SNP call was valid we have to calculate summary
statistics across all loci in all individuals. The -r/-p filtering steps
were implemented prior to the calculation of summary statistics as it
significantly reduces the number of loci in the analysis, significantly
reducing the amount of memory and time required for the operation.

The side effect of this behavior is that different SNPs within a RAD
locus that has been retained after filtering may or may not be valid in
certain individuals. Contrary to what you state this is not a bug but a
function of sequencing error. Here is an example:

* *
consensus ATAAATATGAGTGAGAAGTAGTCTATTTTGAAGCTGAGTGACAGTTTGAGG
model OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOEOOOOOOOOUOOOOOOOOOO
read1 ataaatatgagtgagaagtagtctattttgaGgctgagtgTcagtttgagg
read2 ataaatatgagtgagaagtagtctattttgaGgctgagtgCcagtttgagg
read3 ataaatatgagtgagaagtagtctattttgaAgctgagtgAcagtttgagg
read4 ataaatatgagtgagaagtagtctattttgaAgctgagtgAcagtttgagg
read5 ataaatatgagtgagaagtagtctattttgaAgctgagtgAcagtttgagg

Here is a fictitious RAD locus with 2 SNPs (marked by *), the first one
was called a hEterozygote, the second one, which in this example is an
A/C SNP could not be called because of the T sequencing error and is
Unknown (all other positions have no error and are called hOmozygous).

If you have a RAD locus in one individual with two SNPs, and the depth
of that locus is just enough reads to call a heterozygote at the first
SNP position, due to sequencing error, there may not be as much evidence
for the SNP at the second position and the SNP calling model may not be
able to make a significant call, causing the position to be designated
'unknown' and then no data is counted from that individual for that SNP
position and hence the number of samples for that SNP position could
drop below the original -r/-p threshold.

However, the first SNP is valid and should be included in the analysis.

This also explains summations such as the total number of sites
examined, as you must always take into account the result of the model
when counting the number of homozygotes and heterozygotes and not count
those sites for which we don't know the state.

As of Stacks 1.26, we have implemented a second layer of filtering,
which is done after the summary statistics are calculated allowing us to
remove additional SNP positions that fall below the -r/-p or -a
filtering thresholds (we then have to re-calculate the summary
statistics again after the second filtering step).

julian

Julian Catchen

unread,
Apr 20, 2015, 4:16:46 PM4/20/15
to stacks...@googlegroups.com, msp...@gmail.com
Almost all output from the Stacks pipeline provides the ability to track
down exactly what is happening in the software. In almost all cases you
can count, by eye, all the genotype calls for a specific locus, or
individual to understand exactly what is being output in the VCF, or
Structure, or GenePop files, or whatever other output you want.

Most importantly, the web interface will immediately tell you what is
happening at a specific locus across the population. For each individual
it will give you the output of the SNP calling model at each position
and quickly allow you to see each individual stack for a catalog locus.
It will also tell you broad characteristics about the data (by providing
a set of filters on the web page), such as number of loci with a certain
number of SNPs, or alleles, or samples matching the catalog.

A user can take a locus from the sumstats output or the VCF file, or
another format, and quickly verify by hand that the numbers add up and
the output is exactly correct.

It is not, and should not be "impossible to find out why exactly" for
any aspect of the operation of the pipeline.
Reply all
Reply to author
Forward
0 new messages