I have been trying several pipelines to functionally annotate metagenomes. I have a dataset of ~200 metagenomes with a range 0.5-5 million per sample. While I expect a portion of the reads to be hypothetical or unknown, I am finding that a very large proportion of reads (65-70%) are not classified. I am now trying SUPER-FOCUS with the different aligners but I would like to know if this is typical of what you see in other samples (Like the HMP or coral reef samples in your paper) and if using the -k 1 flag is the only/easiest way to find the reads that are not aligning so I can pull them for downstream analysis?
Many thanks.
Yes, this is the easiest way now (set -k 1) and write a simple script to retrieve unmapped reads from the input.
I will add in my list of future features to retrieve unmapped reads.
Thanks