HOMD database

Brintha V P cs18d017

unread,

Sep 7, 2024, 11:11:33 AM9/7/24

to picrust-users

Hi,

I am using HOMD database for classifying the sequences in my dataset. Now, when I tried to use Picrust2, it is obvious that there is a mismatch in the sequences that contribute to the pathway abundance. Hope, it is because the reference tree of Picrust2 uses greengenes database. How can I use the HOMD sequences to construct the reference tree?

I am using the standalone picrust tool.

Thanks in advance !!

Thanks,

Brintha

Robyn Wright

unread,

Sep 9, 2024, 9:20:21 AM9/9/24

to picrust-users

Hi Brintha,

I don't think that there is necessarily a mismatch with the sequences - it is likely that there is some overlap between the JGI genomes (only the first version of PICRUSt used the GreenGenes database) that were used for building the PICRUSt2 reference database and the HOMD database, as many of both will be in NCBI RefSeq. I do not know what the overlap is likely to be between these, but I'm sure it would be possible to work out which genomes overlap.

You can create a custom reference database for PICRUSt2 using the annotations given by HOMD for the genomes and the tree constructed there, however, I will note that it looks like HOMD uses the NCBI annotations and these are typically created at the time the genomes are added to RefSeq, and could therefore be slightly out of date, so it depends on the functions that you are interested in how useful these are. You can see information about using a non-default reference database here. At some point in the next few months, I hope to have a new database available that uses GTDB genomes (these are taken from NCBI and therefore should overlap with HOMD also).

Robyn

Brintha V P cs18d017

unread,

Sep 10, 2024, 12:16:36 PM9/10/24

to picrust-users

Hi Robyn,

Thank you so much for the details. I will look into the JGI genome database.

To elaborate, in my current analysis I used Picrust2 tool for functional profiling and taxonomy classification of the sequences was performed using Dada2+Qimme2's Naive classifier. I tried to check the taxonomy of the sequences contributing to some random pathways. Most of the sequences that were assigned the taxonomy (Kingdom:Bacteria) have contributed to the pathways instead of the sequences that were assigned some species. So I tried checking some random 5 such sequences using BLAST and most of them were mapped to uncultured_bacterium_partial_strain_16S.

Would it be worth using the custom HOMD database itself for the PICRUST2 analysis?

With respect to the usage of custom reference database, I have completed till the sequence placement step. Now to proceed with the hidden state prediction step, can you suggest some ways on to generate the traits_count table. I saw Gavin's earlier comments on compiling those tables from various sources. Is there a simple way to construct those tables?

Thanks,

Brintha

Brintha V P cs18d017

unread,

Sep 10, 2024, 12:16:38 PM9/10/24

to picrust-users

Hi Robyn,

Thank you so much for the details. I will look into the JGI genome database.

To elaborate, in my current analysis I used Picrust2 tool for functional profiling and taxonomy classification of the sequences was performed using Dada2+Qimme2's Naive classifier. I tried to check the taxonomy of the sequences contributing to some random pathways. Most of the sequences that were assigned the taxonomy (Kingdom:Bacteria) have contributed to the pathways instead of the sequences that were assigned some species. So I tried checking some random 5 such sequences using BLAST and most of them were mapped to uncultured_bacterium_partial_strain_16S.

Would it be worth using the custom HOMD database itself for the PICRUST2 analysis?

With respect to the usage of custom reference database, I have completed till the sequence placement step. Now to proceed with the hidden state prediction step, can you suggest some ways on to generate the traits_count table. I saw Gavin's earlier comments on compiling those tables from various sources. Is there a simple way to construct those tables?

Thanks,

Brintha

On Monday, September 9, 2024 at 6:50:21 PM UTC+5:30 roby...@gmail.com wrote:

Robyn Wright

unread,

Sep 12, 2024, 2:14:54 PM9/12/24

to picrust-users

Hi there,

I'm not entirely sure which database you use, as QIIME2 has a few options for databases. The most frequently used one has been the one based on Silva V138, but Greengenes 2 will probably also be popular now that it is available. I assume that you are using oral microbiome samples (as you are interested in the HOMD), and in this study I actually compared several different classification methods on oral microbiome samples (as requested by a reviewer). We didn't really find any taxa that were not classified below the kingdom level, so I am surprised that you have many of these.

Your taxonomic classifications will not be affected by which database you use for PICRUSt2, and I imagine that you will be better off using PICRUSt2 with the default database, for which the predictions have been validated across multiple datasets, than with a custom database, although if I were you, I would likely explore the predictions from both. I imagine they will be different from one another, but whether the differences are meaningful is another question.

To get the trait tables, you could either re-annotate all of the genomes within HOMD, or - if I were you - I would just generate the ones that you can based on the annotations that exist within HOMD. I would probably take the tsv version of the prokka annotations (i.e., the ones that are found here). You could take the NCBI annotations, but you'd probably need to compile them yourself from the gff files. For compiling the tables, I would just create a genome x EC/COG/whatever level of function you are interested in, where each cell has a count of that function within the genome.

Hope that's helpful!

Robyn

Hi Robyn,

Thank you so much for the details. I will look into the JGI genome database.

To elaborate, in my current analysis I used Picrust2 tool for functional profiling and taxonomy classification of the sequences was performed using Dada2+Qimme2's Naive classifier. I tried to check the taxonomy of the sequences contributing to some random pathways. Most of the sequences that were assigned the taxonomy (Kingdom:Bacteria) have contributed to the pathways instead of the sequences that were assigned some species. So I tried checking some random 5 such sequences using BLAST and most of them were mapped to uncultured_bacterium_partial_strain_16S.

Would it be worth using the custom HOMD database itself for the PICRUST2 analysis?

With respect to the usage of custom reference database, I have completed till the sequence placement step. Now to proceed with the hidden state prediction step, can you suggest some ways on to generate the traits_count table. I saw Gavin's earlier comments on compiling those tables from various sources. Is there a simple way to construct those tables?

Thanks,

Brintha

Brintha V P cs18d017

unread,

Sep 13, 2024, 11:18:03 AM9/13/24

to picrust-users

Thank you so much for the detailed clarification Robyn.

I have used the eHOMD sequences to train the naive bayesian classifier in Qiime and there exist taxons in eHOMD such that they have only one corresponding sequence. That might be the reason why certain sequences are assigned only upto Kingdom level as explained in this article. If the number of sequences are increased for taxonomy classification, then Picrust2 with default database itself be suffice as IMG itself covers most of the taxonomies in HOMD database.

You prompt replies were extremely helpful to figure out the issue. Thanks a lot !!