Filter Haplotype Wise Comparison

97 views
Skip to first unread message

Dylan O'Hearn

unread,
Aug 26, 2023, 12:48:32 PM8/26/23
to Stacks
Hello,

I'm doing a phylogenetic and divergence dating analysis using whole RAD loci phased into haplotypes, and I'm unsure  of whether to use the --filter-haplotype-wise option here.  I'm considering  it because many loci have lots of "N"s in some samples due to low coverage affecting the genotype calls, and I wanted to use some strategy to filter these out; I also considered just using Unix commands to delete all loci that contained strings of, say, two or more Ns, but that seemed kind  of crude.

But when I  looked closer at the FHW output and compared it to the regular output, I saw that it wasn't just removing sites but actually changing what were apparently confident base calls- see screenshot, with regular on the left and FHW on the right. So it seems like it maybe detects the Ns, and  then decides to convert every  base at that position into whatever the consensus base is (T in this case).  Is that right?  If so, I'm concerned about using it for phylogenetics, particularly for a relatively small set of ~100 loci where these differences  may have an impact.

I would appreciate  some more mechanistic detail about what FHW actually does, to see if I'm understanding this correctly or if there's something I'm missing about how it works and when it's appropriate to use.

Thanks!

Dylan
fhw.png

Dylan O'Hearn

unread,
Aug 29, 2023, 3:19:25 PM8/29/23
to Stacks
Having tinkered around with this setting, I think I understand it better.  The first level of populations filter checks whether a locus  is present for 80% (for example) of individuals.  The second  level checks whether each individual variant site at a locus is present for 80% of individuals. Then, if you use --filter-haplotype-wise, it checks whether not just a single variant site but the entire set of variant sites at the locus, i.e.  the haplotype, is  present for 80% of  individuals.  If <80% of individuals have bases called at every variant site (a  complete haplotype), it moves through the locus (from beginning to end?) and when it finds a variant site with missing individuals, it changes  every base at that site, including both "N"s and bases that were confidently called, into whatever the consensus base is, until you reach  the threshold where 80% of individuals have the complete haplotype.

This is infact making alterations to confidently genotyped sections of the locus, in some individuals.  I see how this is fine if you're using  radpainter, for example, since only the variant site calls are in the output, so if a given site is "homogenized" by FHW it effectively doesn't exist.  But if you're doing phylogenetics using  the entire sequence and not just the variant sites,  and maybe especially if you're using a program  such as StarBEAST that estimates population size, this might introduce some bias.  Of course the missingness that FHW filters out could also introduce bias, so  there's probably no perfect option, but anyway it might  be nice to have a filter option that gives more control over what Stacks will do in this situation. 

I don't know enough about the inner workings of Stacks to say what's feasible, but I would propose an alternative form of FHW that the user could  choose, where  instead of homogenizing variant sites to  meet this filter threshold, maybe the entire variant site is deleted from the locus, or the individuals with incomplete haplotypes are deleted from  the locus, or  the entire locus itself is deleted.
Reply all
Reply to author
Forward
0 new messages