Hi Angel, Julian, and wonderful Stacks developers!
I was about to post separately about this same behavior of calling genotypes with a single read and then saw this post, so I wanted to add a bit of context from my own analysis.
I understand that all of this makes sense from the point of view of the model. However, it does seem to result in genotype calls that can have substantial effects on downstream inference. Similar to Trevor, I set the gt-alpha setting for gstacks to 0.01 and, following recommendations from Paris et al. (2017), I did not originally set an additional coverage filter beyond m (the exact text from Paris et al. is :"Once the SNP model has made a determination, its evaluation should be trusted and using further non-statistically based limits on depth of coverage is ill advised and will result in the arbitrary dropping of loci.").
This resulted in some baffling results: even after instituting filters for individual and locus missingness as well as average read depth, over 80% of loci in some populations were out of HWE (p<0.01) and relatedness values of individuals that I knew to be related were wonky. I consulted with colleagues and tried to think about potential biological explanations (inbreeding? contamination?), but struggled to think of an explanation that made sense. In the meantime, similar to Trevor, I discovered that some loci were being called with extremely low read depths, including some that were called as heterozygotes with a read depth of 1(?). So, I recoded as missing any genotype that had a final read depth < m and, once I filtered for missingness, the majority of loci (>90%) were suddenly in HWE and relatedness converged towards expected values. The only change I made to the dataset was that one additional filter.
I wonder if it would be possible/wise to add a filter to gstacks (or populations) that does the same? I understand the statistical justification for calling a heterozygote from a single read if heterozygotes are common at that locus across the metapopulation, but I didn't realize that was happening and would have opted against it if I had the choice. Would it be possible to add an option to gstacks to allow people to decide if they want to impute individual genotypes based on data from the metapopulation?
For Stacks users reading this, consider adding a filter after Stacks runs to recode as missing any locus with extremely low (<m) read depths if you see unexpected funny results as I did.
Regardless, thanks to Julian, Angel, Nicolas, and everybody who has worked on, and continues to work on, Stacks! The program is awesome and I am greatly appreciative of all the work that goes into maintaining and updating it.
Sincerely,
John