I have used SAMSA2 on quite a few samples and regularly find that mycobacterium species (predominantly M. tuberculosis) are commonly identified as being highly prevalent. To confirm this I have tried aligning my fastq files to an M. tuberculosis genome using HISAT2, however when doing this <0.01% of the reads will align, when SAMSA2 outputs >10000 hits to Mycobacterium. I can't think of an explanation for these discrepancies, does anyone have any ideas?
--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatic...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/70cae1e3-e587-4891-8fbc-2bf34a8f709co%40googlegroups.com.
Hi Michael,Interesting - when you get aligned reads to M. tuberculosis, have you checked which functions those are? The RefSeq database that is used by default isn't entire genomes, but instead is the protein sequences of different produced proteins from each included organism. It may be that this is causing the variation in results.I'm assuming that the >10,000 hits you're getting in SAMSA2 is a significant fraction of the total reads that you're putting in.I'm not certain of why this is happening, but my suspicion is that it's a specific protein, perhaps only one or two M. tuberculosis proteins, which are either highly conserved or similar to those from other species and are comprising the majority of the hits.You could also look at the e-value cutoffs; SAMSA2's DIAMOND step runs with an e-value cutoff of, I believe, 0.001. I'm not sure what cutoff threshold you're using for HISAT2.Best,Sam
On Sun, Aug 2, 2020 at 3:37 PM Michael R <m.ra...@outlook.com> wrote:
I have used SAMSA2 on quite a few samples and regularly find that mycobacterium species (predominantly M. tuberculosis) are commonly identified as being highly prevalent. To confirm this I have tried aligning my fastq files to an M. tuberculosis genome using HISAT2, however when doing this <0.01% of the reads will align, when SAMSA2 outputs >10000 hits to Mycobacterium. I can't think of an explanation for these discrepancies, does anyone have any ideas?--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/70cae1e3-e587-4891-8fbc-2bf34a8f709co%40googlegroups.com.