Genotype calls with single read depth in vcf

Trevor Williams

unread,

May 21, 2024, 1:48:22 PM5/21/24

to Stacks

Hello everyone,

I apologize if this or a similar question has been asked before, but I'm hoping to clarify something that I'm seeing in my vcf file output after running populations. I'm using stacks v2.66 to get genotypes of individuals of a freshwater fish species that were sequenced using single-digest RAD-seq with paired-end sequencing. I'm running denovo_map.pl with the remove pcr duplicates filter and a --gt-alpha of 0.01 as my samples have a relatively low depth of coverage (mean depth ~10x).

After running populations, I'm seeing quite a few genotypes being called with only a single read (DP=1). Overall, around 8.5% of calls in my data set have a DP of 1. I'm wondering:

1. Am I interpreting the DP values correctly; that a DP value of 1 means that the genotypes were called with only a single read? If so, I'm worried that my genotypes are biased toward homozygosity.

2. What is the best way to fix this problem? As I mentioned, I already set --gt-alpha to 0.01, but is there any other way in the stacks pipeline to remove calls/loci that have only 1 read?

Thank you in advance for any help/advice you can give! I've attached my denovo_map.log, populations.log, and gstacks.log files here for reference.

Thank you,

Trevor Williams

gstacks.log

populations.log

denovo_map.log

Angel Rivera-Colón

unread,

May 22, 2024, 5:21:59 PM5/22/24

to Stacks

Hi Trevor,

Your interpretation is correct. At these sites, the genotype caller is confidently (with a p-value < 0.01 as specified by the gt-alpha option) assigning a genotype using only observations from a single read. This sounds weird at first; however, from our tests on the matter we see that this happens when you have a major allele at very high frequencies and the single observed read contains that very common major allele. From our observations, most often the genotype is a homozygote for that very common major allele (it sees only one read, but the allele is very common so it can determine it is more likely to be homozygous). From our tests, this is the expected behavior from the point of view of the model.

Regarding how to fix this, from the side of Stacks there is no things to fix, since the model is behaving as expected. Our common advice is assigning a more stringent gt-alpha, as you have done. The genotype calls are still passing the more stringent filter. There are possible alternatives outside of Stacks, such as filtering the VCF for a min depth threshold, but this should account for the observed coverage in the samples.

Biologically, and based on prior observations, this shouldn't be having a very large impact on heterozygosity. In theory, these calls should be happening at sites with very high major allele frequency, so the overall heterozygosity should already be very low. If you see that this is not the case (e.g., calls with DP=1 are being made and retained at more intermediate allele frequencies) please let us know.

Long story short, this is not totally unexpected given the specifications of the model, thus there might not be anything to "fix". However, we also note that it is an interesting (and complex) behavior, so it is something we have looked into.

Thanks,

Angel

John Swenson

unread,

Jun 21, 2024, 3:51:10 PM6/21/24

to Stacks

Hi Angel, Julian, and wonderful Stacks developers!

I was about to post separately about this same behavior of calling genotypes with a single read and then saw this post, so I wanted to add a bit of context from my own analysis.

I understand that all of this makes sense from the point of view of the model. However, it does seem to result in genotype calls that can have substantial effects on downstream inference. Similar to Trevor, I set the gt-alpha setting for gstacks to 0.01 and, following recommendations from Paris et al. (2017), I did not originally set an additional coverage filter beyond m (the exact text from Paris et al. is :"Once the SNP model has made a determination, its evaluation should be trusted and using further non-statistically based limits on depth of coverage is ill advised and will result in the arbitrary dropping of loci.").

This resulted in some baffling results: even after instituting filters for individual and locus missingness as well as average read depth, over 80% of loci in some populations were out of HWE (p<0.01) and relatedness values of individuals that I knew to be related were wonky. I consulted with colleagues and tried to think about potential biological explanations (inbreeding? contamination?), but struggled to think of an explanation that made sense. In the meantime, similar to Trevor, I discovered that some loci were being called with extremely low read depths, including some that were called as heterozygotes with a read depth of 1(?). So, I recoded as missing any genotype that had a final read depth < m and, once I filtered for missingness, the majority of loci (>90%) were suddenly in HWE and relatedness converged towards expected values. The only change I made to the dataset was that one additional filter.

I wonder if it would be possible/wise to add a filter to gstacks (or populations) that does the same? I understand the statistical justification for calling a heterozygote from a single read if heterozygotes are common at that locus across the metapopulation, but I didn't realize that was happening and would have opted against it if I had the choice. Would it be possible to add an option to gstacks to allow people to decide if they want to impute individual genotypes based on data from the metapopulation?

For Stacks users reading this, consider adding a filter after Stacks runs to recode as missing any locus with extremely low (<m) read depths if you see unexpected funny results as I did.

Regardless, thanks to Julian, Angel, Nicolas, and everybody who has worked on, and continues to work on, Stacks! The program is awesome and I am greatly appreciative of all the work that goes into maintaining and updating it.

Sincerely,

John

Catchen, Julian

unread,

Jun 26, 2024, 2:38:13 PM6/26/24

to stacks...@googlegroups.com

Hi John,

Thanks for your detailed message and report on the behavior of your data (and comments on Stacks!). Before I respond in detail, can you tell me what you set m to in your analysis?

Cheers,

Julian

--
Stacks website: http://catchenlab.life.illinois.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/stacks-users/392e7b0f-c906-4592-8fb1-025e7edca7f0n%40googlegroups.com.

John Swenson

unread,

Jul 15, 2024, 2:08:45 PM7/15/24

to Stacks

Hi Julian,

Sorry, I didn't get a notification that there was a reply on this thread and am just now seeing your response.

I had set m to 4 in my analysis.

Thanks,

John

Juan Enciso

unread,

Jul 25, 2024, 9:30:33 AM7/25/24

to Stacks

Hi John,

May I ask what did you use to set loci with depth < m to missing? Did you do this on a genotype basis or on a locus basis? Thank you in advance for any insights you can provide.

Best,

Juan

Angel Rivera-Colón

unread,

Jul 26, 2024, 2:44:49 PM7/26/24

to Stacks

Greetings John and Juan,

John, thanks again for the very detailed message and your report of the software. Your check was very comprehensive and definitely made us think about some details of the software a bit more.

The approach you used makes sense to me. However, I would advice against thinking of m itself as a depth filter for genotypes, since m is just a cutoff to the number of reads required for the clustering (or stacking) of reads early on during de novo locus assembly. It also only applied to the forwards reads and may not be representative of the depth of coverage distribution across the whole locus. In other words, while looking for the distribution of depth and linking that to effects of low coverage sites is a good approach, the threshold for filtering genotypes (when needed) may not be m. Additionally, as also described in the Paris et al. 2017 paper, setting m too high (for example, with the goal of avoiding low depth genotypes) may lead into other biases downstream resulting from poor clustering and assembly of loci.

This is just to say, definitely a good approach to check for any effects related to coverage. Those could be around and can definitely impact the data. However, we shouldn't just think of this just within the scope of m. Independent of this, we have now added a flag in the new version of the software (2.67) designed for filtering genotypes based on coverage (populations --min-gt-depth). Please check that out and let us know if you get different results from your previous approach.

Juan, there is no way in Stacks to use m itself as a depth filter for the reasons I mentioned above. I would guess that John used a different software to do his desired filtering. The updated flag in populations applies this depth filter at the level of the genotype.

Thanks,

Angel

John Swenson

unread,

Aug 9, 2024, 10:22:05 AM8/9/24

to Stacks

Hi Angel,

Many thanks for your reply and for considering my inquiry as you have. I also appreciate the reminder that the paired-end reads don't come into play until later, so we cannot think of m and genotype depth as equivalent. I will update Stacks and try the new flag in populations - thank you for including this with the new update!

Juan, Angel is right: I used a custom script that I wrote in R to fine-tune my set of SNPs after running Stacks. It seems like with the new filter in populations, that step may now be redundant (but I will run to compare).