Hi,
I am using vphaser2 and feeding the input into vparser using the -noendvariant=10 -nt -codon option. Using the codon level analysis is really great for me as I am sequencing HIV and you can have quite a lot of variance from the reference sequence.
After verifying by sanger some of the SNPs I find, I see that the fisher's test for strand bias omits some of the real snps in the codon analysis. Hence, I am now using the raw snp output, although perhaps some different stats can help here (like using confidence intervals, etc.).
However, I also noticed that false positives occur sometimes when there are many reads for a particular codon but a high fraction of them are low quality reads (e.g. >70% LowQual).
For example the GGG here is not observed by Sanger and has really high LowQual :
1467-1469 (coverage of 14495):
Accepted codons:
GGG (G) 3.78% (LowQual : 97.62%)
GGT (G) 96.22% (LowQual : 3.49%)
Rejected codons:
GAT (D) 2 count (2 HQ reads)
GG- (-) 1 count (1 HQ reads)
GTT (V) 2 count (0 HQ reads)
GGA (G) 2 count (1 HQ reads)
CGT (R) 2 count (0 HQ reads)
AGT (S) 2 count (0 HQ reads)
TGT (C) 3 count (1 HQ reads)
GGC (G) 5 count (3 HQ reads)
GCT (A) 4 count (0 HQ reads)
--- (-) 7 count (7 HQ reads)
My questions are:
First, can you let me know what is the basis for calling low quality reads (is this using Qual scores or other metrics).
Second, is there a way to exclude calling mutations that stem from codons with high percentage of low quality reads (e.g. in vphaser or vparser) or do I need to do this manually/write a new script.
Thanks again,
Ron