Assistance required converting Mothur-generated BIOM File for PICRUSt2 analysis

20 views
Skip to first unread message

Jay

unread,
Feb 18, 2025, 8:28:34 AMFeb 18
to picrust-users

Dear PICRUSt2 development team,

I am currently working on a project where I aim to predict metagenomic functions using the PICRUSt2 pipeline by following Workflow PICRUSt2‐MPGA database · picrust/picrust2 Wiki. My workflow involves processing 16S rRNA gene sequences with Mothur (version 1.48.0) and subsequently using PICRUSt2 for functional inference. However, I have encountered challenges in generating a compatible BIOM file for PICRUSt2.

Workflow details

  1. Sequence processing with Mothur

    • Clustering and taxonomy assignment: I processed my sequences using Mothur's standard operating procedures, utilizing the SILVA 132 reference database (97% similarity) for alignment and classification of the V3_V4 region.
    • BIOM file generation: To create the BIOM file, I executed the following Mothur command
      make.biom(shared=AlgOpti1.trim.contigs.renamed.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.opti_mcc.shared, constaxonomy=AlgOpti1.trim.contigs.renamed.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.opti_mcc.0.03.cons.taxonomy)
  2. Validation of the BIOM file

    • After generating the BIOM file (AlgOpti1.trim.contigs.renamed.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.opti_mcc.0.03.biom), I attempted to validate it using the biom tool:


      biom validate-table -i AlgOpti1.trim.contigs.renamed.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.opti_mcc.0.03.biom

    • The validation returned the following errors

      Unknown table type, however that is likely okay. Timestamp does not appear to be ISO 8601 Missing required 'sample/matrix' group Missing required 'sample/matrix/data' dataset Missing required 'sample/matrix/indices' dataset Missing required 'sample/matrix/indptr' dataset Number of sample IDs is not equal to the described shape Table indicates it is version 2.0 The input file is not a valid BIOM-formatted file.

I have installed  Ensure you have the BIOM CLI tools installed. If not, you can install them via conda:   conda install -c bioconda biom-format

Challenges faced

  • BIOM file compatibility: The BIOM file generated by Mothur does not conform to the expected format required by PICRUSt2. The validation errors suggest structural issues within the file.
  • Taxonomy reference: I utilized the SILVA 132 database for sequence classification. I am aware that PICRUSt2 is optimized for Greengenes taxonomy, and this discrepancy might contribute to compatibility issues.
  • Does  AlgOpti1.trim.contigs.renamed.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.opti_mcc.shared file need to be reshaped into .txr files by correcting header???

I appreciate your time and assistance in resolving these issues. 

Kind regards.

Sincerely
Jay

picrust2_environment.yml

Robyn Wright

unread,
Feb 18, 2025, 8:42:33 AMFeb 18
to picrust-users
Hi Jay,

I'm actually not familiar with the biom-validate command, but you don't mention running into a PICRUSt2 error, so have you tried just using the biom table that you've generated with PICRUSt2? If yes, perhaps it would be simpler to convert your Mothur table to a tab delimited file - it is often easier to see what might be going wrong when you can open up the table separately. If these tables don't work with PICRUSt2, perhaps you could share one of those and your fasta file and I can look at why they might not be working? 

Also, you don't need any taxonomic assignments for PICRUSt2. PICRUSt2 is expecting that you have: (1) a file containing representative sequences for all of your ASVs/OTUs, and (2) a table containing the abundance of those sequences (rows) in your samples (columns). The sequence names should be the same between (1) and (2). PICRUSt2 then places the sequences into the phylogenetic tree (in PICRUSt1, it used Greengenes taxonomy assignments, but this isn't the case in PICRUSt2) and determines which sequences in the tree are closest for generating predictions. The latest version will also tell you which sequence is closest, but has not been validated as a method for taxonomic assignment. 

Robyn

Reply all
Reply to author
Forward
0 new messages