hidden state predictions with custom tree

Sara Beier

unread,

Sep 12, 2024, 2:00:49 PM9/12/24

to picrust-users

Dear Gavin,

I have currently two issues using PICRUSt2

1) I would like to predict 16s gene copy numbers not using your default database, but the rrnDB. This is because my analyses indicated that the RRN values of your references database might be less accurate then those in rrnDB. For this purpose I run the following command:

picrust2_pipeline.py -s test.fasta -i test.asv.tsv --no_pathways --min_samples 0 -o pic.rrnDB -p 32 --marker_gene_table rrnDB_customREF/16S_rrnDB.txt --ref_dir rrnDB_customREF/ --skip_minpath --edge_exponent 0

the output for RRN predictions is created as I want to, but the command anyway ends with the following error message:

Error running this command:
hsp.py --tree /bio/Analysis_data/IOWseq000048_gesifus.field/snakemake_test_20240824/Intermediate_results/metab/pic.rrnDB/tmp/out.tre --output /bio/Analysis_data/IOWseq000048_gesifus.field/snakemake_test_20240824/Intermediate_results/metab/pic.rrnDB/tmp/EC_predicted.tsv.gz --observed_trait_table /bio/Software/anaconda3/envs/picrust2-2.5.1/lib/python3.8/site-packages/picrust2/default_files/prokaryotic/ec.txt.gz --hsp_method mp --edge_exponent 0.0 --seed 100 --processes 32

obviously, the reference file EC_predicted.tsv.gz does not fit to the species in my default reference tree. As I am not interested in EC predictions, I have so far ignored this error. However, I am currently working on integrating the PICRUST2 predictions into a multi-step snakemake pipe, and the snakemake pipline stops due to this error. I hoped to fix this problem by adding the flag --no_pathways, but the error is still produced. I get this error for the picrust versions 2.4.2 and 2.5.1. As I do not have admin rights on our server I currently cannot try with the latest PICRUSt2 version that is not installed. Any idea how to enable the EC prediction in my command to avoid the error message and make the command working within my snakemake pipeline?

2) I am also running further trait predictions via the hsp.py command. This is working well using version 2.3.0_b. Versions 2.4.2 and 2.5.1 produce the same output, but it takes forever. Is it possible that there is a bug with the parallelization of the hsp.py script in newer PICRUST2 versions?

Everything best,

Sara

Robyn Wright

unread,

Dec 16, 2024, 4:40:08 PM12/16/24

to picrust-users

Hi Sara,

Apologies for the slow reply, I must have missed this message earlier. Although I am also much better able to answer this now than I would have been a few months ago.

Also FYI, I have taken over as PICRUSt2 "tech support" - Gavin hasn't been in the Langille lab for a few years now and wanted to move on!

If you are using a completely different tree, then I would recommend running each of the steps separately until you have an EC table/a trait table of some kind. When you are running picrust2_pipeline.py, this is really just a wrapper for a few different scripts:

place_seqs.py - this is the script used for placing study sequences into the reference tree

hsp.py - this script will be run multiple times, first for predicting marker gene copy numbers (and also calculating Nearest Sequenced Taxon Indices) and then for predicting any traits that you ask it to (by default, it will run this for the marker gene, EC numbers and KOs)

metagenome_pipeline.py - this is for converting the predictions from hsp.py that are on an individual sequence basis into metagenome predictions (multiplying copy number and abundance within your feature table with the predicted abundances of traits)

pathway_pipeline.py - this is for inferring the abundance of pathways within your predicted metagenome

If you are only looking to predict 16S copy numbers at this point, you could run only place_seqs.py and hsp.py with the marker gene table.

I am not too familiar with the earlier versions of PICRUSt2 and what the differences are, but the default running of hsp.py within the picrust2_pipeline.py script is to only use 1 thread when running hsp.py with a marker gene table. If you ran both steps separately then you could set this for yourself. I typically find that it runs in a few minutes anyway (even on my laptop), even with a few thousand sequences, but if you have a very large tree then this could be different. Also, the probability of every "hidden state" will be predicted with hsp.py - I found that even when running hsp.py with just a few sequences, if I had a trait in my reference table that had a large number of copies, it would take a really long time to run. It could be something like that for you?

I'll also note that I've just finished making a new database that is based on GTDB genomes and Eggnog annotations. I don't know if this is useful for your purposes, but I did find that I needed to re-identify the 16S copies for myself within the genomes as what GTDB provided didn't seem to work out (some that weren't matching bacteria within bacterial genomes, some way too short, etc), so I used barrnap with bacterial or archaeal models to identify them. This is not my specialty, so I don't know if this is rigorous enough for your purposes or not, but if you are interested in trying out the new database then you can find some details on it here. If you do use it, I'd really appreciate it if you'd let me know how it goes.

Thanks,

Robyn

Sara Beier

unread,

Dec 27, 2024, 10:24:28 AM12/27/24

to picrust-users

Hi Robyn,

thanks for your reply. Indeed, running the individual scripts place_seqs.py and hsp.py with the marker gene table solves the problem of the error message.

I am still having the problem that the hsp.py in the newer PICRUSt versions runs (differently then in PIRRUSt 2.3.0_b) for extended times. I believe that there might be a bug with the parallelization in the newer PIRRUSt versions.

I did not yet have time testing your new database, but I will try it out at some point and then let you know.

Best,

Sara

Robyn Wright

unread,

Jan 6, 2025, 9:09:15 AMJan 6

to picrust-users

I'm glad that's solved! Could you try running both with the --verbose option, to see if there is a difference in what they are running? Also, do you get the same results for running with both versions?

Thanks,

Robyn

Sara Beier

unread,

Jan 22, 2025, 8:54:33 AMJan 22

to picrust-users

I meanwhile run the code with the --verbose option (picrust2-2.3.0_b versus picrust-2.5.0 ). The predicted values are identical and the metadata_NSTI differ by the number of digits. However, while it takes about 20 min for picrust2-2.3.0_b, it takes >24h for picrust-2.5.0. Please see below the detailed code and STDOUT

best regards,

Sara

```
# picrust2-2.3.0_b
## code
hsp.py --observed_trait_table /bio/Databases/GenomicTraitsPrediction_PICRUST2/p.HGT_perc.txt -t out.tre -o pic.HGT_perc_predicted_t1 -n -p 20 --verbose

## STDOUT
Rscript /bio/Software/anaconda3/envs/picrust2-2.3.0_b/lib/python3.6/site-packages/picrust2/Rscripts/castor_nsti.R out.tre /tmp/tmp17gpp93l/known_tips.txt /tmp/tmp17gpp93l/nsti_out.txt
Warning message:
package ‘Rcpp’ was built under R version 3.6.3

Rscript /bio/Software/anaconda3/envs/picrust2-2.3.0_b/lib/python3.6/site-packages/picrust2/Rscripts/castor_hsp.R out.tre /tmp/tmpyggkfnks/subset_tab_0 mp FALSE FALSE /tmp/tmprvt9aca8/predicted_counts.txt /tmp/tmprvt9aca8/predicted_ci.txt 100
Warning message:
package ‘Rcpp’ was built under R version 3.6.3

# picrust-2.5.0
## code
hsp.py --observed_trait_table /bio/Databases/GenomicTraitsPrediction_PICRUST2/p.HGT_perc.txt -t out.tre -o pic.HGT_perc_predicted_t2 -n -p 20 --verbose

## STDOUT
Rscript /bio/Software/anaconda3/envs/picrust-2.5.0/lib/python3.8/site-packages/picrust2/Rscripts/castor_nsti.R out.tre /tmp/tmpx829zz4q/known_tips.txt /tmp/tmpx829zz4q/nsti_out.txt

Rscript /bio/Software/anaconda3/envs/picrust-2.5.0/lib/python3.8/site-packages/picrust2/Rscripts/castor_hsp.R out.tre /tmp/tmpvrnk2ljx/subset_tab_0 mp 0.5 FALSE FALSE /tmp/tmpfqmrhth7/predicted_counts.txt /tmp/tmpfqmrhth7/predicted_ci.txt 100

```

Reply all

Reply to author

Forward