how I change the ko database

52 views
Skip to first unread message

TY H

unread,
Mar 17, 2025, 2:10:22 PMMar 17
to picrust-users
Hello, I know that the software has created a good ko database for me, but I still need to compare the files to my own protein library, I read the wiki tutorial and it is very simple, even a bit arrogant, and I didn't even find the .model file I needed in raxml. In fact, I didn't understand how these were generated one by one, which made me angry because I had to spend a lot of time learning other software.

Now, could you please tell me in detail in this question, how the .model file is generated, and how I finally replace the default ko database with the database I generated based on my .fasta form this.

THANK !!!

Robyn Wright

unread,
Mar 17, 2025, 3:56:20 PMMar 17
to picrust-users
Hi there,

I'll just first note that I am the same person answering questions here as on the PICRUSt2 Github, so no need to message both - I try to answer questions as soon as I can but I am a human so am not always able to get to things immediately :). The wiki page does not aim give a tutorial for creating these files - it links to some of the steps that you will need to take in order to get these files if you wish to create your own database. If you would like us to make a tutorial then we can look into adding that at some point. Learning new software is a part of bioinformatics and what we have had to do to be able to create the databases - I know it can be frustrating, and that is why we have provided PICRUSt2 databases for people to use, so that not everyone has to learn extra software. If you want to use a program in a way that is outside of the default, you will need to learn new things and telling us that you are angry at us will not make us want to help you. However, I have provided some more details on the steps necessary to create a database below. You can see all of the steps that I took to create the new database for PICRUSt2 here, which of course includes creating these files. 

To begin, you will need:
- 16S sequences for your database (or other amplicon sequencing target sequences)
- A table containing annotations/traits for all of the sequences that you want to include in your database
- A table containing counts of the copy number for each of your sequences
- A phylogenetic tree of your sequences

Then, you will need to:
1. Align the sequences - if using 16S sequences then I recommend ssu-align for this step
2. Run raxml-ng --check with the aligned sequences
3. Run raxml-ng --evaluate with your checked files and the phylogenetic tree
4. Reformat the checked files to a stockholm alignment format
5. Build the HMM with your stockholm-aligned files
6. The raxml_info file may get created by the raxml-ng --evaluate command, but it seems to depend a little on which version is used. You can see the details of how I got this from a newer raxml version in the page that I linked above, I'll note that if I remember correctly, it's only needed for SEPP (and pplacer) though, and if you are running epa-ng for placement then it wouldn't be needed.

These are just the basic steps - you can see the commands that I ran for creating the new database in the page that I linked, and hopefully you can decide for yourself which of these you will need to run.

Robyn

TY H

unread,
Mar 26, 2025, 8:48:52 AMMar 26
to picrust-users
感谢您的回复,事实上,请注意我的问题,首先我认可各位作者的努力和wiki的编写,但是对于自定义数据库这里的内容非常简短,甚至不是一个一键化脚本,让我不禁思考是否是故意不建议使用者这么允许,要知道,在其他软件中,比如blast,构建自定义库仅仅是一个命令的事情。
其次,wiki中没有指明,构建数据库需要的fasta文件,是基因文件,还是蛋白文件。我查看了您构建的ko数据库,里面的pro_ref.fna是基因文件,也就是包含ATCG,然而事实上,我的手里仅有蛋白文件,它是这样的:
> AF022812 | AAC60788 | Sulfurospirillum multivorans strain N | 1
MEKKKKPELSRRDFGKLIIGGGAAATIAPFGVPGANAAEKEKNAAEIRQQFAMTAGSPIIVNDKLERYAEVRTAFTHPTSFFKPNYKGEVKPWFLSAYDEKVRQIENGENGPKMKAKNVGEARAGRALEAAGWTLDINYGNIYPNRFFMLWSGETMTNTQLWAPVGLDRRPPDTTDPVELTNYVKFAARMAGADLVGVARLNRNWVYSEAVTIPADVPYEQSLHKEIEKPIVFKDVPLPIETDDELIIPNTCENVIVAGIAMNREMMQTAPNSMACATTAFCYSRMCMFDMWLCQFIRYMGYYAIPSCNGVGQSVAFAVEAGLGQASRMGACITPEFGPNVRLTKVFTNMPLVPDKPIDFGVTEFCETCKKCARECPSKAITEGPRTFEGRSIHNQSGKLQWQNDYNKCLGYWPESGGYCGVCVAVCPFTKGNIWIHDGVEWLIDNTRFLDPLMLGMDDALGYGAKRNITEVWDGKINTYGLDADHFRDTVSFRKDRVKKS
> AY013367 | AAG46194 | Sulfurospirillum halorespirans DSM 13726 | 1
MEKKKKPELSRRDFGKLIIGAGAAATIAPFGVPGANAAEKEKNAAEIRQQFAMTAGSPIIVNDKLERYAQVRTAFTHPTSMFKPNYKGEVKHWFLSSCDEKVRQIENGENGPKMKAKNVGEARAGRALEAAGWTLDXNFGGSFGSYYPNRFSMLWSGETMLNTQMWATVGLDRRPPDTTDPVELTNYVKFAARMAGADLVGVARLNRNWVYSGAVTIPDEQSWHKEIEKPIVFKDVPLPIETDDELIIPNTCDNVIVSGIAMNREMLQTAPTSM

我也不知道16S对应的基因拷贝数,我基于它构建的数据库,是否可以正常工作?

Robyn Wright

unread,
Mar 26, 2025, 11:11:38 AMMar 26
to picrust-users
I have used Google translate, so maybe there are some errors, but this is what Google said that you said, so I will respond to that below:

Thank you for your reply. In fact, please pay attention to my question. First of all, I recognize the efforts of the authors and the writing of the wiki, but the content here for the custom database is very brief, and it is not even a one-click script. I can't help but wonder whether it is deliberately not recommended to users to allow it. You know, in other software, such as blast, building a custom library is just a matter of command.

Secondly, the wiki does not specify whether the fasta file required to build the database is a gene file or a protein file. I checked the ko database you built, and the pro_ref.fna in it is a gene file, which contains ATCG. However, in fact, I only have the protein file, which is like this:

> AF022812 | AAC60788 | Sulfurospirillum multivorans strain N | 1
MEKKKKPELSRRDFGKLIIGGGAAATIAPFGVPGANAAEKEKNAAEIRQQFAMTAGSPIIVNDKLERYAEVRTAFTHPTSFFKPNYKGEVKPWFLSAYDEKVRQIENGENGPKMKAKNVGEARAGRALEAAGWTLDINYGNIYPNRFFMLWSGETMTNTQLWAPVGLDRRPPDTTDPVELTNYVKFAARMAGADLVGVARLNRNWVYSEAVTIPADVPYEQSLHKEIEKPIVFKDVPLPIETDDELIIPNTCENVIVAGIAMNREMMQTAPNSMACATTAFCYSRMCMFDMWLCQFIRYMGYYAIPSCNGVGQSVAFAVEAGLGQASRMGACITPEFGPNVRLTKVFTNMPLVPDKPIDFGVTEFCETCKKCARECPSKAITEGPRTFEGRSIHNQSGKLQWQNDYNKCLGYWPESGGYCGVCVAVCPFTKGNIWIHDGVEWLIDNTRFLDPLMLGMDDALGYGAKRNITEVWDGKINTYGLDADHFRDTVSFRKDRVKKS
> AY013367 | AAG46194 | Sulfurospirillum halorespirans DSM 13726 | 1
MEKKKKPELSRRDFGKLIIGAGAAATIAPFGVPGANAAEKEKNAAEIRQQFAMTAGSPIIVNDKLERYAQVRTAFTHPTSMFKPNYKGEVKHWFLSSCDEKVRQIENGENGPKMKAKNVGEARAGRALEAAGWTLDXNFGGSFGSYYPNRFSMLWSGETMLNTQMWATVGLDRRPPDTTDPVELTNYVKFAARMAGADLVGVARLNRNWVYSGAVTIPDEQSWHKEIEKPIVFKDVPLPIETDDELIIPNTCDNVIVSGIAMNREMLQTAPTSM

I also don't know the gene copy number corresponding to 16S. Can the database I built based on it work properly?

I am paying attention to your question, but your question is not clear. In my first response to you on the PICRUSt2 Github I asked you for some more information on what it is exactly that you're trying to do. I've tried to help without knowing this based on the questions that you are asking me, but as you appear not to like my answers, I assume that I don't have enough information from you currently. So please give me more information specifically on:

1. What is the data that you have that you are trying to build a PICRUSt2 database with? Is it complete genomes?

2. What is the data that you are trying to run through this PICRUSt2 database? Is it amplicon sequencing data?

I need to know exactly what the data you have is, what format it is in, and what you are trying to get from this data. 


PICRUSt2 is used for the prediction of functions within amplicon sequencing data. It requires a reference database of genomes with annotated functions as well as a phylogenetic tree containing marker genes within these genomes. The sequences from your study would then be compared to the marker genes within the genomes (most of the time, it is the 16S rRNA gene that we are using here, although it doesn't have to be the 16S rRNA gene). PICRUSt2 is only designed to work with nucleic acid sequences for both the study sequences and the reference sequences within the tree. When we are comparing the sequences in our study to the genes in our reference tree, we must first make a multiple sequence alignment of these to determine the best fit within the tree, and then the gene content of our study sequences is predicted based on their placement in the tree. You can refer to the flowchart here for a detailed description of all of the steps involved. A tool like BLAST is simply comparing two nucleotide sequences, hence it is easy to make a reference database. PICRUSt2 is carrying out a lot more steps, and the files needed for each of these steps often need to be constructed with different parameters, hence there is not a single command to make a PICRUSt2 database. Constructing a PICRUSt2 database is not a trivial task - in fact, we are currently in the process of publishing an entire paper just on the construction of a new PICRUSt2 database; this would not be necessary if there were a way to simplify this to a single command. 

I am not stopping users from building a custom database, but building a database does require some bioinformatic knowledge as well as knowledge of the organisms that you want to have in the database, and I would therefore not recommend it to a user performing their first bioinformatic analysis. If there is a particular function that you are interested in then it is much easier to add this into the existing database, and I am currently in the process of writing step-by-step instructions for users to do this. This will still require more bioinformatic skill than just using the existing PICRUSt2 database, but will be easier than constructing an entirely new database. 

TY H

unread,
Mar 27, 2025, 8:52:22 AMMar 27
to picrust-users
Thank you for your frequent reply, and we apologize for taking up your valuable time.
I may have forgotten to translate, this is a major mistake, please forgive me for my previous answer

What I'm going to do now is I have my own 16S RNA data, and now, I want to see what dehalogenases it has expressed, so I'm going to compare the https://rdasedb.biozone.utoronto.ca/downloads with the dehalogenase database because kogg only has 14 homologous dehalogenases and I only got up to 5 dehalogenases with picrust2 2.6, which obviously doesn't meet my needs.

Now, should I build a database with  NT.fasta  , or  AA.fasta   or should I build a database that is related to each other, and which process works?

Note: The  NT.fasta  may not be 16S data

Robyn Wright

unread,
Mar 27, 2025, 10:08:21 AMMar 27
to picrust-users
Thanks for clarifying.

As this is a single function and not an entire database of organisms, I'd recommend that you custom trait table for this with the existing database. This is actually exactly what I'm working on making a tutorial for at the moment, and this tutorial will include details on constructing a Hidden Markov Model for a function of interest, running this against the genomes included in the existing database, and then collating the output of this HMM search to use as a custom trait table. The manuscript for the new database is currently under revision, and I plan to have this added to the PICRUSt2 wiki within the next week or so. 

The steps involved will be:
1. Download the GTDB genomes from their website
2. Filter the GTDB genomes to include only those within the PICRUSt2 database
3. Create an alignment of the sequences for your function of interest
4. Create a HMM with this alignment
5. Run the HMM against the PICRUSt2 genomes
6. Collate the results of the HMM searches to create a custom trait table

I will update you when I have made this tutorial, and it will be on this page.

Robyn

Reply all
Reply to author
Forward
0 new messages