IMG/MER questions

96 views
Skip to first unread message

sdpapet

unread,
Oct 25, 2016, 12:15:47 PM10/25/16
to IMG User Forum

I am using the function of "Phylogenetic Distribution of Metagenomes" on IMG/MER.

1> What is the normal cutoff that I should choose? 30+% 60% or 90%?
2>The website says "phylogenetic distribution of genes for selected metagenomes". Here the "distribution of genes" means all functional genes or just phylogenic gene markers such as 16SrRNA.
3>When compared, which database do you use? COG? KO? What's the default database?
4>Also, when I selected the "Estimated gene copies" The output on the website has the gene number like this 38(36). Which one is the raw gene count and which one is the estimate gene count? However, when I download the excel sheet (see my attachment), there is no any numbers inside the parentheses. No matter I set "estimated gene copies" or "gene count", I got the same results.
5>How does the website calculate the relative abundance (percentage of each taxonomy)? For each sample, I use the number of genes that found in a phylum and divide by the total number genes manually, I got the different percentage from the website.

The last question is about "Genome Clustering"

For each calculation such as  PCA, PCoA, nMDS, can you tell me what distance matrix does it use? Is it Bray-curits or other matrix.

Thanks,
Ben

Natalia Ivanova

unread,
Oct 25, 2016, 5:33:30 PM10/25/16
to IMG User Forum
1. The cutoffs are designed to be representative of different taxonomy levels: hits at 90+% are likely to the genomes from the same species or genus; hits at 60-90% identity are likely from the same family or order; hits at 30-60% identity are probably from the same class or phylum. The cutoff selection depends on what you want to see.
2. The phylodistribution is for protein-coding genes only; percent identity brackets for ribosomal RNAs won't be compatible with protein-coding genes.
3. The reference database is IMG-NR, a collection of isolate genomes and trusted single cell genomes in IMG.
4. The number in parentheses is the number of genomes with hits, not the number of genes with hits. If you selected "estimated copies", the number shown is number of genes multiplied by coverage, and that's what you get when you export the results. If you click on the count in IMG, you will be shown the number of genes without multiplying by coverage. If you see the same number of genes regardless of whether you select gene counts or estimated gene copies, it means that the metagenome lacks coverage information, which should have been provided at the time of submission.
5. The denominator is the total number of genes with at least some hits in isolate genomes with >=30% identity, not the total number of genes.
Reply all
Reply to author
Forward
0 new messages