Concerns with alignment accuracy

Michael R

unread,

Aug 2, 2020, 6:37:05 PM8/2/20

to SAMSA bioinformatics group

I have used SAMSA2 on quite a few samples and regularly find that mycobacterium species (predominantly M. tuberculosis) are commonly identified as being highly prevalent. To confirm this I have tried aligning my fastq files to an M. tuberculosis genome using HISAT2, however when doing this <0.01% of the reads will align, when SAMSA2 outputs >10000 hits to Mycobacterium. I can't think of an explanation for these discrepancies, does anyone have any ideas?

Sam Westreich

unread,

Aug 11, 2020, 6:26:44 PM8/11/20

to Michael R, SAMSA bioinformatics group

Hi Michael,

Interesting - when you get aligned reads to M. tuberculosis, have you checked which functions those are? The RefSeq database that is used by default isn't entire genomes, but instead is the protein sequences of different produced proteins from each included organism. It may be that this is causing the variation in results.

I'm assuming that the >10,000 hits you're getting in SAMSA2 is a significant fraction of the total reads that you're putting in.

I'm not certain of why this is happening, but my suspicion is that it's a specific protein, perhaps only one or two M. tuberculosis proteins, which are either highly conserved or similar to those from other species and are comprising the majority of the hits.

You could also look at the e-value cutoffs; SAMSA2's DIAMOND step runs with an e-value cutoff of, I believe, 0.001. I'm not sure what cutoff threshold you're using for HISAT2.

Best,

Sam

On Sun, Aug 2, 2020 at 3:37 PM Michael R <m.rad...@outlook.com> wrote:

I have used SAMSA2 on quite a few samples and regularly find that mycobacterium species (predominantly M. tuberculosis) are commonly identified as being highly prevalent. To confirm this I have tried aligning my fastq files to an M. tuberculosis genome using HISAT2, however when doing this <0.01% of the reads will align, when SAMSA2 outputs >10000 hits to Mycobacterium. I can't think of an explanation for these discrepancies, does anyone have any ideas?

--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatic...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/70cae1e3-e587-4891-8fbc-2bf34a8f709co%40googlegroups.com.

--

Sam Westreich

Microbiome Scientist, DNAnexus,

http://www.mosaicbiome.com

Michael R

unread,

Aug 11, 2020, 6:34:14 PM8/11/20

to SAMSA bioinformatics group

Hi Sam,

That's correct, I think it may be skewing the results as it is making Mycobacterium the most abundant organism in most samples.

Is there anyway I could work around this? It is making it difficult to interpret the results.

On Wednesday, August 12, 2020 at 8:26:44 AM UTC+10, S. Westreich (creator) wrote:

Hi Michael,

Interesting - when you get aligned reads to M. tuberculosis, have you checked which functions those are? The RefSeq database that is used by default isn't entire genomes, but instead is the protein sequences of different produced proteins from each included organism. It may be that this is causing the variation in results.

I'm assuming that the >10,000 hits you're getting in SAMSA2 is a significant fraction of the total reads that you're putting in.

I'm not certain of why this is happening, but my suspicion is that it's a specific protein, perhaps only one or two M. tuberculosis proteins, which are either highly conserved or similar to those from other species and are comprising the majority of the hits.

You could also look at the e-value cutoffs; SAMSA2's DIAMOND step runs with an e-value cutoff of, I believe, 0.001. I'm not sure what cutoff threshold you're using for HISAT2.

Best,
Sam

On Sun, Aug 2, 2020 at 3:37 PM Michael R <m.ra...@outlook.com> wrote:

I have used SAMSA2 on quite a few samples and regularly find that mycobacterium species (predominantly M. tuberculosis) are commonly identified as being highly prevalent. To confirm this I have tried aligning my fastq files to an M. tuberculosis genome using HISAT2, however when doing this <0.01% of the reads will align, when SAMSA2 outputs >10000 hits to Mycobacterium. I can't think of an explanation for these discrepancies, does anyone have any ideas?

--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/70cae1e3-e587-4891-8fbc-2bf34a8f709co%40googlegroups.com.

Daniel Revillini

unread,

Apr 29, 2022, 12:16:12 PM4/29/22

to SAMSA bioinformatics group

Hello all,

Interested to find out if you've ever resolved the analyses here? I am running into the same problem with Bacillus cereus across 46 samples...? Concerned there is either actual contamination of some sort in early processing or alignment weirdness...

Thanks!

Reply all

Reply to author

Forward