Metagenome genome classifier comes up with a large amount of UNKNOWN. Any tips to help?

103 views
Skip to first unread message

Jay Osvatic

unread,
May 9, 2016, 3:55:58 PM5/9/16
to CLARK Users

Hi,

I just started using CLARK and have been using metagenome bins, created by ESOM, as an input. Every bins from the assembly currently has over 50% UNKNOWN. Is there a way to reduce that percentage and increase the total amount lableled?

Would reducing the K-mer size help?

I am running CLARK on all the default setting.


Thank you,

Jay Osvatic
Swingley Lab
Northern Illinois University
 

Rachid

unread,
May 12, 2016, 12:28:57 AM5/12/16
to CLARK Users
Hello Jay!

Yes, by reducing the k-mer length (e.g., 19, 20, or 21) with the full mode (option "-m 0") you should get more hits -- but also more noise, that's why we recommend to filter results and use high-confidence assignments.

Maybe you could also try CLARK-S (option "--spaced") if you have not already? In order to do so, you will need to build the databases of spaced k-mers (script "buildSpacedDB.sh", cf. README file).

Finally, what is your database? If using smaller k-mers and CLARK-S do not help then you could expand your database with more genomes (using the "Custom" database, cf. README file) so there will be more k-mers to compare with (e.g., the full set of bacteria/archaea/viruses genomes from RefSeq).

Please let us know if this helps!

Best,
Rachid
Reply all
Reply to author
Forward
0 new messages