Filter Haplotype Wise Comparison

97 views

Skip to first unread message

Dylan O'Hearn

unread,

Aug 26, 2023, 12:48:32 PM8/26/23

to Stacks

Hello,

I'm doing a phylogenetic and divergence dating analysis using whole RAD loci phased into haplotypes, and I'm unsure of whether to use the --filter-haplotype-wise option here. I'm considering it because many loci have lots of "N"s in some samples due to low coverage affecting the genotype calls, and I wanted to use some strategy to filter these out; I also considered just using Unix commands to delete all loci that contained strings of, say, two or more Ns, but that seemed kind of crude.

But when I looked closer at the FHW output and compared it to the regular output, I saw that it wasn't just removing sites but actually changing what were apparently confident base calls- see screenshot, with regular on the left and FHW on the right. So it seems like it maybe detects the Ns, and then decides to convert every base at that position into whatever the consensus base is (T in this case). Is that right? If so, I'm concerned about using it for phylogenetics, particularly for a relatively small set of ~100 loci where these differences may have an impact.

I would appreciate some more mechanistic detail about what FHW actually does, to see if I'm understanding this correctly or if there's something I'm missing about how it works and when it's appropriate to use.

Thanks!

Dylan

fhw.png

Dylan O'Hearn

unread,

Aug 29, 2023, 3:19:25 PM8/29/23

to Stacks

Having tinkered around with this setting, I think I understand it better. The first level of populations filter checks whether a locus is present for 80% (for example) of individuals. The second level checks whether each individual variant site at a locus is present for 80% of individuals. Then, if you use --filter-haplotype-wise, it checks whether not just a single variant site but the entire set of variant sites at the locus, i.e. the haplotype, is present for 80% of individuals. If <80% of individuals have bases called at every variant site (a complete haplotype), it moves through the locus (from beginning to end?) and when it finds a variant site with missing individuals, it changes every base at that site, including both "N"s and bases that were confidently called, into whatever the consensus base is, until you reach the threshold where 80% of individuals have the complete haplotype.

This is infact making alterations to confidently genotyped sections of the locus, in some individuals. I see how this is fine if you're using radpainter, for example, since only the variant site calls are in the output, so if a given site is "homogenized" by FHW it effectively doesn't exist. But if you're doing phylogenetics using the entire sequence and not just the variant sites, and maybe especially if you're using a program such as StarBEAST that estimates population size, this might introduce some bias. Of course the missingness that FHW filters out could also introduce bias, so there's probably no perfect option, but anyway it might be nice to have a filter option that gives more control over what Stacks will do in this situation.

I don't know enough about the inner workings of Stacks to say what's feasible, but I would propose an alternative form of FHW that the user could choose, where instead of homogenizing variant sites to meet this filter threshold, maybe the entire variant site is deleted from the locus, or the individuals with incomplete haplotypes are deleted from the locus, or the entire locus itself is deleted.

Reply all

Reply to author

Forward

0 new messages