Hi Eric,
Thanks for your reply. First, I need to apologize in that part of my initial post was partially inaccurate. UNMAPPED reads constituted, as I mentioned, 30 - 40% of the reads depending on the parameters used. Thus, 60 - 70% of the reads could be assigned to a gene. However, in the pathway (but not the genefamilies), UNINTEGRATED reads constituted another 50 - 70%, such that only 5 - 10% of the initial reads could be assigned to a pathway. As to your suggestions for troubleshooting:
-- I confirmed that there were no ambiguous nucleotides in the input fasta files.
-- I inspected the output for four of my samples. The *_bowtie2_unaligned.fa ranged from 30 - 60% of the original sequences. The *diamond_unaligned.fa ranged from 20 - 50% of the original sequences.
-- I BLASTed the first five sequences from the *diamond_unaligned.fa file. Three of those resulted in no matches on the nucleotide BLAST, with a fourth matching B vulgatus with a negative BLASTx output. The fifth had 92% identity to R. torques, with minimal BLASTx matches.
-- I ran DIAMOND on the *diamond_unaligned.fa file, using the code as indicated in the log file. Somewhat to my surprise, 2.9 million queries aligned, out of approximately 6.7 million input sequences.
My take home messages from this are that there probably are not a whole lot of false negatives among the sequences that failed to align during the HUMAnN2 run -- maybe a minority of them. I am curious what your take is on the finding that the vast majority of sequences that did align with Bowtie2 or DIAMOND failed to map to a pathway. Do you think this is a problem with Metacyc, or was that a fairly typical finding with the original HUMAnN that defaulted to KEGG?
Regards,
Matt