Running Picrust2 for large dataset

21 views
Skip to first unread message

Chieh-Chang Chen

unread,
Nov 11, 2025, 7:03:02 PMNov 11
to picrust-users
I'm currently running a large dataset through PICRUSt2, but the process has been extremely slow—it's been over two weeks with minimal progress.
The input dataset was filtered to exclude ASVs present in fewer than 0.5% of samples and with fewer than 100 total reads. After filtering, the dataset includes:
- Samples: 4,630
- Features (ASVs): 14,246
- Total read count: 461,615,396
Does anyone have suggestions for optimizing performance or troubleshooting this issue?

The version of Picrust2 is 2.6.2, and currently the memory usage was around 300G
There are warnings:
metagenome_pipeline.py:317: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  func_abun_subset['taxon'] = func_abun_subset.index.to_list()

Thanks!

Robyn Wright

unread,
Nov 26, 2025, 8:42:43 AMNov 26
to picrust-users
Hi there,

Sorry for the delay.

That does seem to be quite a lot of ASVs, as well as quite a lot of samples. Although it is not too dissimilar from a dataset that I've run myself recently (~7000 ASVs and 4800 samples). I didn't check the memory usage for that, but it I used 24 threads and it apparently took 7081 seconds (~2 hours) to run (our server has 1.5TB RAM). I would not have expected the total read count to make much difference as this would only be used for generating the metagenome abundances, but mine had ~110M total reads. 

How many threads were you using for this? Did it end up finishing? 

Best wishes,
Robyn
Reply all
Reply to author
Forward
0 new messages