rarefy guppy command

21 views
Skip to first unread message

Michael Doane

unread,
Oct 12, 2017, 1:49:03 PM10/12/17
to pplacer users
Hey all,

I'm thinking about the idea of rarefying my samples for downstream analysis. I'm unsure how the 'rarefy' command operates in guppy. Is there a method recommended to determine what to rarefy each sample down too? I assume that best would be to determine the smallest sample then rarefy all samples to that one. However, I'm unsure how to go about determining the smallest. Do I determine the number of candidate sequences in my set of files, or am I looking for the number of placements (including pquery and multiplicities)?

Thanks in advance.

Erick Matsen

unread,
Oct 12, 2017, 4:34:48 PM10/12/17
to pplacer users
This is up to the user. Yes, downsampling to the smallest number of unique sequences is common.

--
You received this message because you are subscribed to the Google Groups "pplacer users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pplacer-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
Frederick "Erick" Matsen, Associate Member
Fred Hutchinson Cancer Research Center
http://matsen.fredhutch.org/

Toni Govednik

unread,
Dec 30, 2025, 8:25:34 AM (12 days ago) 12/30/25
to pplacer users
Hi,

First of all thanks for the nice tools for dealing with phylogenetic placement data. I have a question pertaining to the topic of rarefaction of the jplace files. Namely, I was trying to rarefy all my jplace files to the smallest number of pqueries which was in some cases 113, in other also higher ~500 or ~700. Anyway, the problem that I encountered is that the number of reads in rarefied jplace files is lower than the selected number and that the different jplace files differ among themselves. What am I doing wrong?

Before rarefaction:

gappa examine info --jplace-path .

Found 18 jplace files

Sample          Branches      Leaves    Pqueries
X10_Clade_I         3286        1644         173
X11_Clade_I         3286        1644         181
X12_Clade_I         3286        1644         161
X1_Clade_I          3286        1644         180
X3_Clade_I          3286        1644         153
X5_Clade_I          3286        1644         137
X7_Clade_I          3286        1644         169
X8_Clade_I          3286        1644         140
X9_Clade_I          3286        1644         153
T10_Clade_I         3286        1644         150
T11_Clade_I         3286        1644         133
T12_Clade_I         3286        1644         113
T1_Clade_I          3286        1644         181
T3_Clade_I          3286        1644         133
T5_Clade_I          3286        1644         114
T7_Clade_I          3286        1644         124
T8_Clade_I          3286        1644         143
T9_Clade_I          3286        1644         140

Rarefaction:

#rarefying files
        guppy rarefy $file \
                                --out-dir $EXTDIR \
                                -o $name".jplace" \
                                -n 113

After rarefaction:

Found 18 jplace files

Sample          Branches      Leaves    Pqueries
N10_Clade_I         3286        1644          81
N11_Clade_I         3286        1644          81
N12_Clade_I         3286        1644          82
N1_Clade_I          3286        1644          80
N3_Clade_I          3286        1644          78
N5_Clade_I          3286        1644          73
N7_Clade_I          3286        1644          81
N8_Clade_I          3286        1644          78
N9_Clade_I          3286        1644          77
T10_Clade_I         3286        1644          77
T11_Clade_I         3286        1644          74
T12_Clade_I         3286        1644          72
T1_Clade_I          3286        1644          79
T3_Clade_I          3286        1644          75
T5_Clade_I          3286        1644          71
T7_Clade_I          3286        1644          73
T8_Clade_I          3286        1644          75
T9_Clade_I          3286        1644          75

Thank you in advance!

Erick

unread,
Jan 2, 2026, 12:04:17 PM (9 days ago) Jan 2
to pplacer users
Could you please provide some smallish example files demonstrating the behavior?

Toni Govednik

unread,
Jan 2, 2026, 12:32:32 PM (9 days ago) Jan 2
to pplacer users
Hi Erick,

Happy new year!
Yes for sure. I'm attaching three jplace files and their respective rarefied files. The rarefaction that was used (for all the 18 samples of which I'm attaching three) was N=307 based on the sample T12.


The part of the script (of the larger workflow that was done) is as follows:

for file in "$JPLACE_DIR"/*.jplace; do
    name=$(basename "$file" .jplace)
    echo "   - $name"

    guppy rarefy "$file" \
        --out-dir "$RARE_DIR" \
        -o "${name}.jplace" \
        -n 307

I really appreciate your help!
Toni

Erick Matsen

unread,
Jan 3, 2026, 7:11:09 AM (8 days ago) Jan 3
to pplace...@googlegroups.com
Hi Toni,

Thanks for the example files - they helped track down the issue.

The short answer: guppy rarefy -n 307 samples 307 reads, not 307 pqueries. Because multiple sampled reads can belong to the same pquery (and some pqueries won't be sampled at all), the output will have fewer than 307 distinct pqueries. This is expected behavior for rarefaction.

This matches standard ecological rarefaction, which normalizes sequencing depth (read count), not species richness (pquery count).

The documentation is misleading - the help text says "number of pqueries to keep" when it should say "number of reads to sample." 

I will fix this.


Thanks,

Erick

Toni Govednik

unread,
Jan 4, 2026, 4:35:59 AM (7 days ago) Jan 4
to pplacer users
Hi Erick,

Thanks a lot for this explanation, it makes perfect sense. I guess I will need to obtain the lowest read number then and run rarefaction again with that number. To get the number of total placements for each read I'll probably need to run adcl command and then sum all the counts for individual samples? Is there a more direct way to do this like for example using examine info command?

Thanks a lot in advance!

Toni

Toni Govednik

unread,
Jan 4, 2026, 4:52:20 AM (7 days ago) Jan 4
to pplacer users
Already found that by using guppy info I get all the information I need. Thanks nevertheless!
Toni

Erick Matsen

unread,
Jan 4, 2026, 7:26:05 AM (7 days ago) Jan 4
to pplace...@googlegroups.com
I'm not sure if I understand. Why do you want the total number of placements? Don't you want just the number of reads, which you have?

Toni Govednik

unread,
Jan 4, 2026, 9:46:45 AM (7 days ago) Jan 4
to pplacer users
I probably didn't express myself correctly, what I meant is the total number of reads per sample which I equated to placements, however this is not the same as the read can be placed multiple times. Anyway, thanks a lot for all the help!
Reply all
Reply to author
Forward
0 new messages