Hi there,
I'm not entirely sure which database you use, as QIIME2 has a few options for databases. The most frequently used one has been the one based on Silva V138, but Greengenes 2 will probably also be popular now that it is available. I assume that you are using oral microbiome samples (as you are interested in the HOMD), and in
this study I actually compared several different classification methods on oral microbiome samples (as requested by a reviewer). We didn't really find any taxa that were not classified below the kingdom level, so I am surprised that you have many of these.
Your taxonomic classifications will not be affected by which database you use for PICRUSt2, and I imagine that you will be better off using PICRUSt2 with the default database, for which the predictions have been validated across multiple datasets, than with a custom database, although if I were you, I would likely explore the predictions from both. I imagine they will be different from one another, but whether the differences are meaningful is another question.
To get the trait tables, you could either re-annotate all of the genomes within HOMD, or - if I were you - I would just generate the ones that you can based on the annotations that exist within HOMD. I would probably take the tsv version of the prokka annotations (i.e., the ones that are found
here). You could take the NCBI annotations, but you'd probably need to compile them yourself from the gff files. For compiling the tables, I would just create a genome x EC/COG/whatever level of function you are interested in, where each cell has a count of that function within the genome.
Hope that's helpful!
Robyn
Hi Robyn,
Thank you so much for the details. I will look into the JGI genome database.
To elaborate, in my current analysis I used Picrust2 tool for functional profiling and taxonomy classification of the sequences was performed using Dada2+Qimme2's Naive classifier. I tried to check the taxonomy of the sequences contributing to some random pathways. Most of the sequences that were assigned the taxonomy (Kingdom:Bacteria) have contributed to the pathways instead of the sequences that were assigned some species. So I tried checking some random 5 such sequences using BLAST and most of them were mapped to uncultured_bacterium_partial_strain_16S.
Would it be worth using the custom HOMD database itself for the PICRUST2 analysis?
With respect to the usage of custom reference database, I have completed till the sequence placement step. Now to proceed with the hidden state prediction step, can you suggest some ways on to generate the traits_count table. I saw Gavin's earlier comments on compiling those tables from various sources. Is there a simple way to construct those tables?
Thanks,
Brintha