Kraken2 returns majority (97%+) reads unclassified

1,012 views
Skip to first unread message

Rachel Mugge

unread,
Jul 24, 2022, 12:13:10 PM7/24/22
to Microbiome Helper
Hi all,

Thanks for the great tutorials. I am following the MG SOPv3 and am attempting to assign taxonomy using Kraken2, but I cannot get more than ~3% of reads to classify (for one sample). I've already posted to the Kraken2 github, but figured I would ask here as well, to get another opinion. Here's what I have tried:

-Multiple databases, including kraken2 standard, kraken2 standard 8gb, RefSeqCompleteV205 100gb, SILVA132, kraken NCBI

-Adjusting '--confidence' at multiple values between 0 and 0.5

-Input concatenated reads file as fastq

-Input concatenated reads file as fasta

-Input paired (R1 and R2) sequence files (fastq) using the '--paired' flag

-Running with and without '--use-names'

-Running with and without '--memory-mapping'

-Running with and without '--quick'

Any ideas on what might be happening? I checked if it was a pre-processing error by 1. Running FastQC on concatenated sequencing lanes and on concatenated reads (F and R) following the KneadData step (everything looks good to me), and by 2. Testing a sample from a different dataset that was pre-processed following the MG SOPv2 (same sample type, also did not classify >3%). These samples came from marine biofilms on steel and I don't think they are that novel; however, maybe I am missing something obvious.

Thank you,

Rachel

Andre Comeau

unread,
Jul 25, 2022, 12:26:54 PM7/25/22
to Microbiome Helper
Rachel,
Can you tell me (if we sequenced) or send me that one sample's gzipped files? I can try running it here on our system and see if we can reproduce it using the large RefSeqComplete we use.


ANDRÉ M. COMEAU, PhD
Manager  Integrated Microbiome Resource (IMR)
T: 902.494.2684 | E: andre....@dal.ca 

Address for deliveries:
Dept. of Pharmacology
Tupper Med. Bldg., room 5D
Dalhousie University
5850 College St.
Halifax NS B3H 4R2 

Research Associate (Lab Manager)

Morgan Langille Lab  Dept. of Pharmacology
ResearchGate Profile GoogleScholar Publications


"Without fantasy, there is no science. Without fact, there is no art." - Nabokov
"The good thing about science is that it's true whether or not you believe in it." - Neil deGrasse Tyson 


From: microbio...@googlegroups.com <microbio...@googlegroups.com> on behalf of Rachel Mugge <rachel...@gmail.com>
Sent: Sunday, July 24, 2022 1:13 PM
To: Microbiome Helper <microbio...@googlegroups.com>
Subject: [microbiome-helper] Kraken2 returns majority (97%+) reads unclassified
 
CAUTION: The Sender of this email is not from within Dalhousie.
--
You received this message because you are subscribed to the Google Groups "Microbiome Helper" group.
To unsubscribe from this group and stop receiving emails from it, send an email to microbiome-hel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/microbiome-helper/9a89ec69-ac9c-4c2d-80f9-495b18f0c993n%40googlegroups.com.

Rachel Mugge

unread,
Jul 25, 2022, 2:28:34 PM7/25/22
to Microbiome Helper
Hi Dr. Comeau,
That would be great, thank you! Yes, this sample (named B001) is from a IMR sequencing run a few months ago. The zipped file is too large to upload here but I will share a Google Drive link shortly.

Rachel

Rachel Mugge

unread,
Jul 25, 2022, 2:34:30 PM7/25/22
to Microbiome Helper
Also- here is my workflow thus far, in case that helps to diagnose.
Mugge_MBH_MG_SOPv3_workflow_072022.docx

Andre Comeau

unread,
Jul 25, 2022, 5:37:59 PM7/25/22
to Microbiome Helper
Rachel,
The sample is definitely odd/novel as I get similarly poor classification:

Kraken2 using Confidence Score = 0.5 (normal, more stringent)
6025970 sequences (710.91 Mbp) processed in 11.738s (30803.2 Kseq/m, 3633.99 Mbp/m).
  74896 sequences classified (1.24%)
  5951074 sequences unclassified (98.76%)

Kraken2 using Confidence Score = 0.1 (very permissive)
6025970 sequences (710.91 Mbp) processed in 8.940s (40444.6 Kseq/m, 4771.43 Mbp/m).
  813116 sequences classified (13.49%)
  5212854 sequences unclassified (86.51%)

...so it is not something with your system, but seems to be a legitimate difference with this sample.

Was this sample also ever done for 16S+18S here at the IMR? If so, were there a lot of novel/unclassified ASVs coming out of it?

In case it wasn't done, I'm just running SortMeRNA right now to extract the rRNAs from the metaG data and then we can have a look at them.



ANDRÉ M. COMEAU, PhD
Manager  Integrated Microbiome Resource (IMR)
T: 902.494.2684 | E: andre....@dal.ca 

Address for deliveries:
Dept. of Pharmacology
Tupper Med. Bldg., room 5D
Dalhousie University
5850 College St.
Halifax NS B3H 4R2 

Research Associate (Lab Manager)

Morgan Langille Lab  Dept. of Pharmacology
ResearchGate Profile GoogleScholar Publications


"Without fantasy, there is no science. Without fact, there is no art." - Nabokov
"The good thing about science is that it's true whether or not you believe in it." - Neil deGrasse Tyson 


Sent: Monday, July 25, 2022 3:34 PM
To: Microbiome Helper <microbio...@googlegroups.com>
Subject: Re: [microbiome-helper] Kraken2 returns majority (97%+) reads unclassified
 

Rachel Mugge

unread,
Jul 25, 2022, 6:09:23 PM7/25/22
to Microbiome Helper
Well, that is partially good news! I am glad it is not user error, at least.

Yes, we do have 16S data from IMR for this dataset. Let me look into that now and get back to you.

Rachel Mugge

unread,
Jul 25, 2022, 6:58:23 PM7/25/22
to Microbiome Helper
Alright, here's some relevant 16S data about that sample:

Sequence count as input to DADA2: 52439
Sequence ("feature") count after DADA2: 37766
Within whole dataset (n = 131, same sample type), number of ASVs: 9536
Within whole dataset, matching query sequences following assignment using VSEARCH and SILVA132 at 99% as reference: 9475 (99.36%)
Unassigned reads in sample B001: 0

So by comparing the 16S and MG data, the low number of classified reads from Kraken2 is concerning...since we are both getting the same result, maybe it's a bug in the newest version of Kraken2? I saw a recent post on the Kraken2 github from one of your lab members who compared percentages of classified reads from 3 versions of Kraken2 and got quite different results. Not sure if that is the same issue here, but I cannot think of what else the problem might be. 

Rachel Mugge

unread,
Jul 25, 2022, 7:24:54 PM7/25/22
to Microbiome Helper
I should have tried this earlier, but I am also going to run 4 other random samples from this dataset (n = 24) to make sure the first one isn't an anomaly.

Rachel Mugge

unread,
Jul 26, 2022, 10:37:17 AM7/26/22
to Microbiome Helper
Results from the 4 samples:

Command:

kraken2 --db /Volumes/FireFly_Promise_Pegasus/Databases/kraken2/k2_standard --threads 12 --quick --output /Volumes/FireFly_Promise_Pegasus/RMugge/DISSERTATION/Ch3/Metagenome_Microcosm2.0/kraken2_testing_random/output/output.kraken2.txt --report /Volumes/FireFly_Promise_Pegasus/RMugge/DISSERTATION/Ch3/Metagenome_Microcosm2.0/kraken2_testing_random/output/kraken2_report --use-names --memory-mapping --confidence 0.5 /Volumes/FireFly_Promise_Pegasus/RMugge/DISSERTATION/Ch3/Metagenome_Microcosm2.0/kraken2_testing_random/input/*.fastq

Output:

289348 sequences classified (1.02%)

  27965686 sequences unclassified (98.98%)


Andre Comeau

unread,
Jul 26, 2022, 12:18:40 PM7/26/22
to Microbiome Helper
OK good to know about the 16S. However, were there dominant ASVs in there that were only classified down to a shallow tax level (such as Proteobacteria-unclassified)? Those will still show up as matching query sequences, just get labeled with much less fine taxonomy.

For comparison, I extracted all the rRNA genes from the metaG sample and the largest amount of hits for the LSU were coming back as the following NCBI accession (see attached pivot table of the results as well):


Interestingly, this seems to be an unclassified zeta-proteobacterium scaffold from Bigelow that has since been removed from the dbase (it says due to awaiting publication, but that was submitted in 2013) and hence would not be in the Kraken2 dbases.

Maybe you are getting a highly specific consortium of these zetas being selected for in these biofilms. It might be worth trying some MAG assembly with these samples (either individually or "pooled" if behaving the same way) to see if you can get some larger scaffolds that would confirm you have a few dominant organisms/strains here and that they may be some pretty novel stuff. You'd be then able to map this LSU sequence back to the scaffolds to prove whether it belongs to this new zeta and you can then do a read mapping back to the constructed scaffolds to measure the amount of original reads you are recovering using them (therefore telling you whether the majority of reads missed by Kraken2 map to these new scaffolds and hence why they were missed).



ANDRÉ M. COMEAU, PhD
Manager  Integrated Microbiome Resource (IMR)
T: 902.494.2684 | E: andre....@dal.ca 

Address for deliveries:
Dept. of Pharmacology
Tupper Med. Bldg., room 5D
Dalhousie University
5850 College St.
Halifax NS B3H 4R2 

Research Associate (Lab Manager)

Morgan Langille Lab  Dept. of Pharmacology
ResearchGate Profile GoogleScholar Publications


"Without fantasy, there is no science. Without fact, there is no art." - Nabokov
"The good thing about science is that it's true whether or not you believe in it." - Neil deGrasse Tyson 

Sent: Monday, July 25, 2022 7:58 PM
MuggeB001.aligned.blast.xlsx

Andre Comeau

unread,
Jul 26, 2022, 12:19:11 PM7/26/22
to Microbiome Helper
OK - I got the impression from your first email it was just the first one sample that was behaving weirdly...see my last email about maybe assembling them all then.


ANDRÉ M. COMEAU, PhD
Manager  Integrated Microbiome Resource (IMR)
T: 902.494.2684 | E: andre....@dal.ca 

Address for deliveries:
Dept. of Pharmacology
Tupper Med. Bldg., room 5D
Dalhousie University
5850 College St.
Halifax NS B3H 4R2 

Research Associate (Lab Manager)

Morgan Langille Lab  Dept. of Pharmacology
ResearchGate Profile GoogleScholar Publications


"Without fantasy, there is no science. Without fact, there is no art." - Nabokov
"The good thing about science is that it's true whether or not you believe in it." - Neil deGrasse Tyson 


Sent: Tuesday, July 26, 2022 11:37 AM

Rachel Mugge

unread,
Jul 26, 2022, 4:03:49 PM7/26/22
to Microbiome Helper
Ah okay, I see what you mean- in the 16S data for this sample (after collapsing to species level and converting to relative frequency in QIIME2), individual taxa that account for >1% of the microbiome (n = 10) are classified to either the family or genus level, except for the top 2 taxa which are annotated as "uncultured bacterium" at the species level, and together account for ~40% of the microbiome: Sulfurimonas uncultured bacterium (22.85%) and Mariprofundus uncultured bacterium (17.43%). This is partially concurrent with your results where the top hit is a Zeta and the 4th and 5th hits are both Sulfurimonas species. The numbers don't quite line up with only 1-3% of reads being classified by Kraken2, but now it makes a bit more sense.

Thank you for helping me troubleshoot, and for the suggestions! I have not yet done an assembly-based metagenomics approach, but I will try that next as it looks like that will be necessary for this dataset.

Rachel

On Tuesday, July 26, 2022 at 11:18:40 AM UTC-5 André Comeau wrote:
OK good to know about the 16S. However, were there dominant ASVs in there that were only classified down to a shallow tax level (such as Proteobacteria-unclassified)? Those will still show up as matching query sequences, just get labeled with much less fine taxonomy.

For comparison, I extracted all the rRNA genes from the metaG sample and the largest amount of hits for the LSU were coming back as the following NCBI accession (see attached pivot table of the results as well):


Interestingly, this seems to be an unclassified zeta-proteobacterium scaffold from Bigelow that has since been removed from the dbase (it says due to awaiting publication, but that was submitted in 2013) and hence would not be in the Kraken2 dbases.

Maybe you are getting a highly specific consortium of these zetas being selected for in these biofilms. It might be worth trying some MAG assembly with these samples (either individually or "pooled" if behaving the same way) to see if you can get some larger scaffolds that would confirm you have a few dominant organisms/strains here and that they may be some pretty novel stuff. You'd be then able to map this LSU sequence back to the scaffolds to prove whether it belongs to this new zeta and you can then do a read mapping back to the constructed scaffolds to measure the amount of original reads you are recovering using them (therefore telling you whether the majority of reads missed by Kraken2 map to these new scaffolds and hence why they were missed).



ANDRÉ M. COMEAU, PhD
Manager  Integrated Microbiome Resource (IMR)
T: 902.494.2684 | E: andre....@dal.ca 

Address for deliveries:
Dept. of Pharmacology
Tupper Med. Bldg., room 5D
Dalhousie University
5850 College St.
Halifax NS B3H 4R2 

Research Associate (Lab Manager)

Morgan Langille Lab  Dept. of Pharmacology
ResearchGate Profile GoogleScholar Publications


"Without fantasy, there is no science. Without fact, there is no art." - Nabokov
"The good thing about science is that it's true whether or not you believe in it." - Neil deGrasse Tyson 

Andre Comeau

unread,
Jul 26, 2022, 5:16:58 PM7/26/22
to Microbiome Helper
OK yes, things are starting to line up. Remember, though, that you won't necessarily get 1:1 concordance between 16S and metaG, as it is very easy to identify a 16S gene and so we can usually taxonomically call most things in a sample something. Also note that 16S is PCR based (and only semi-quantitative, so proportions can get skewed after amplification) and so will potentially not amplify extremely divergent things, so the community can look pretty well covered even if there is something "quite-abundant-but-novel" there (although coverage these days is getting pretty good so that Universal primers are missing less and less things).

The metaG, being non-PCR based, should be less biased and therefore pick up more stuff (as long as the abundance is high enough) that the primers (being biased themselves) would not have picked up and should be in more representative proportions...the only problem then is that the whole genomes contain many more genes than just the one easily-identifiable 16S and so it is a lot harder to put a label on the much larger novel gene content in those genomes given the lack of corresponding reference sequences in the Kraken2/NCBI dbase.

So in summary, samples that may be a bit more difficult to analyze/describe, but potentially way more ecologically exciting!



ANDRÉ M. COMEAU, PhD
Manager  Integrated Microbiome Resource (IMR)
T: 902.494.2684 | E: andre....@dal.ca 

Address for deliveries:
Dept. of Pharmacology
Tupper Med. Bldg., room 5D
Dalhousie University
5850 College St.
Halifax NS B3H 4R2 

Research Associate (Lab Manager)

Morgan Langille Lab  Dept. of Pharmacology
ResearchGate Profile GoogleScholar Publications


"Without fantasy, there is no science. Without fact, there is no art." - Nabokov
"The good thing about science is that it's true whether or not you believe in it." - Neil deGrasse Tyson 


Sent: Tuesday, July 26, 2022 5:03 PM

Rachel Mugge

unread,
Jul 27, 2022, 6:09:28 PM7/27/22
to Microbiome Helper
Right, that all makes sense- thank you for taking the time to explain it. Yes, even though analysis for this dataset may be more difficult, getting the end result of ecologically exciting samples is what moves science forward, after all!

One more question: theoretically, say I assemble these samples and then map back to the aforementioned Zeta, and discover that the metagenome(s) is/are indeed dominated by this taxon. How should I proceed with describing the taxonomic composition? Is it still appropriate to use the classified 3% of reads from Kraken2 with the caveat that the dominant taxon is not in the database, or would those results no longer be relevant since I would be mixing results from read-based and assembly-based methods?

Andre Comeau

unread,
Jul 28, 2022, 1:09:07 PM7/28/22
to Microbiome Helper
Well, you could still use the Kraken2 results for the minor portion of the community covered and say that in the analysis/paper, but then something like: however, post MAG assembly and mapping back to the novel contigs/genomes then showed that XX% of the reads mapped to the novel zeta genomes, accounting for the majority of the missing taxonomic annotation. Or something to that effect, depending on the results.

An additional option you have, if you achieve a fairly good/clean MAG assembly of a novel zeta genome you feel you can trust (based on completeness, lack of contamination as measure by, say, CheckM), then you can integrate that new record into your K2 dbase and rerun it to see how the percent covered improves.



ANDRÉ M. COMEAU, PhD
Manager  Integrated Microbiome Resource (IMR)
T: 902.494.2684 | E: andre....@dal.ca 

Address for deliveries:
Dept. of Pharmacology
Tupper Med. Bldg., room 5D
Dalhousie University
5850 College St.
Halifax NS B3H 4R2 

Research Associate (Lab Manager)

Morgan Langille Lab  Dept. of Pharmacology
ResearchGate Profile GoogleScholar Publications


"Without fantasy, there is no science. Without fact, there is no art." - Nabokov
"The good thing about science is that it's true whether or not you believe in it." - Neil deGrasse Tyson 


Sent: Wednesday, July 27, 2022 7:09 PM

Rachel Mugge

unread,
Jul 28, 2022, 5:39:39 PM7/28/22
to Microbiome Helper
Okay, that is what I figured, but wanted to confirm with someone more knowledgeable than myself.

I will consider that additional option as well, assuming I can figure out how to do it...

Thank you!

Reply all
Reply to author
Forward
0 new messages