MetaPhlAn number of reads

991 views
Skip to first unread message

Cesar Alejandro Perez Fernandez

unread,
Aug 16, 2017, 9:13:26 PM8/16/17
to MetaPhlAn-users
Hi!

I'm new using MetaPhlan and i have a few questions about the results.

I notice that the options -t (for type of analysis) and -stat (for normalization)are stated for calculing and normalize the relative abundance of microorganismis. I would like to know how to obtain total abundance, or read counts, without normalization of each taxa.

I'm grateful for the help!!
Cesar

Nicola Segata

unread,
Aug 17, 2017, 7:55:03 AM8/17/17
to Cesar Alejandro Perez Fernandez, MetaPhlAn-users
Hi Cesar,
 because of how MetaPhlAn works it is not possible to have read counts. You can however have "pseudo" read counts by multiplying relative abundances by a constant and rounding to the closest integer.

best
Nicola

--
You received this message because you are subscribed to the Google Groups "MetaPhlAn-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metaphlan-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Cesar Alejandro Perez Fernandez

unread,
Aug 17, 2017, 10:57:43 AM8/17/17
to MetaPhlAn-users, capf...@gmail.com, nicola...@unitn.it
Dear Nicola,

Thanks for the response

I have an additional question about the parameter --min_cu_len, the default of this is 2000. Does it mean that all markers of a clade, to be considered as a clade, should be at least 2000 bp? If I a choose a lesser value, i.e 1000, it will produce the double of clades? Which considerations should I take for this parameter?

Kind regards,
Cesar

Nicola Segata

unread,
Aug 17, 2017, 11:00:47 AM8/17/17
to Cesar Alejandro Perez Fernandez, MetaPhlAn-users
Hi Cesar,
 2000 is for the sum of the length of all the markers of a clade. I would  not recommend lowering it...

thanks
Nicola

Cesar Alejandro Perez Fernandez

unread,
Aug 17, 2017, 3:22:38 PM8/17/17
to MetaPhlAn-users, capf...@gmail.com, nicola...@unitn.it
One last question.

If the relative abundance of certain taxa is near to 0, Does Metaphlan report it? My question is about the less abundant taxa that their relative abundances are almost 0.

Nicola Segata

unread,
Aug 18, 2017, 2:47:05 AM8/18/17
to Cesar Alejandro Perez Fernandez, MetaPhlAn-users
Yes, MetaPhlAn will report the abundances of all taxa as long as at least 10% of its markers have a non zero abundance.
thanks
Nicola

On Thu, Aug 17, 2017 at 9:22 PM Cesar Alejandro Perez Fernandez <capf...@gmail.com> wrote:
One last question.

If the relative abundance of certain taxa is near to 0, Does Metaphlan report it? My question is about the less abundant taxa that their relative abundances are almost 0.

Adam Retchless

unread,
Apr 19, 2018, 2:41:55 PM4/19/18
to MetaPhlAn-users
Hi Nicola,

This question of the "count equivalent" has been nagging me, and I'm glad to see you've been thinking about it. Do you have any suggestion for how to identify a proper scaling to use? My first thought is that the relative abundances should be multiplied by the number of reads that were used by metaphlan in order to get an "effective number of observations". But this may be on the high side.

On a related note, is there any way to calculate the lower limit of detection for a species? My understanding is that it would vary by species...but perhaps not by much.

Thanks very much for maintaining this program and providing support!

Regards
Adam

Michael McLaren

unread,
May 3, 2018, 12:48:36 PM5/3/18
to MetaPhlAn-users
On Thursday, April 19, 2018 at 2:41:55 PM UTC-4, Adam Retchless wrote:
> Hi Nicola,
>
> This question of the "count equivalent" has been nagging me, and I'm glad to see you've been thinking about it. Do you have any suggestion for how to identify a proper scaling to use? My first thought is that the relative abundances should be multiplied by the number of reads that were used by metaphlan in order to get an "effective number of observations". But this may be on the high side.
>
> On a related note, is there any way to calculate the lower limit of detection for a species? My understanding is that it would vary by species...but perhaps not by much.
>
> Thanks very much for maintaining this program and providing support!
>
> Regards
> Adam

I'm facing this problem as well. With amplicon data, it is simple enough to model the observed reads per amplicon sequence variant as multinomial (conditional on the total number of reads). It seems to me that an equivalent model for metaphlan's species abundance estimates would be as follows. If x_i is the frequency of species i and l_i is the total length of markers for species i, then x_i*l_i gives the relative probability of reads mapping to species i. We could then model the number of reads per species as multinomial with a probabilities x_i*l_i / (sum_j x_j*l_j) and a total observed read count equal to the sum of total reads mapped to all species markers. We could then use the observed read counts per species and the marker lengths l_i to get estimate uncertainty in estimated abundances of low-frequency species. I'm not sure what the effect of the threshold of 2000 nt for calling a species would be on this model, though.

Nicola Segata

unread,
May 4, 2018, 10:54:43 AM5/4/18
to Michael McLaren, MetaPhlAn-users
Dear Michael and Adam,
 if I understood it correctly, you are looking for a way to get count data out of MetaPhlAn2 output. This is not straightforward because MetaPhlAn2 uses only few markers and not the whole genomes as reference.

There are however two ways to get an estimate:
- just multiply the total number of reads by the relative abundance of each species. This works well, but you are likely overestimating the number of reads in each species because you cannot count the number of reads that would not map against any reference genome
- use the "-t rel_ab_w_read_stats" which estimates the number of reads that should come from a given species by considering the coverage of the species' markers and the length of the genome (taken from reference genomes).

I hope this helps
Nicola


Michael McLaren

unread,
May 4, 2018, 1:33:00 PM5/4/18
to MetaPhlAn-users
Hi Nicola,
Thanks for your reply. Speaking for myself, I am after a way to get the uncertainty in the relative abundance estimates. The total number of reads seems irrelevant for the uncertainty; what I want is the number of reads that could have mapped to a metaphlan species-level marker, and the number of reads that did actually map to a species level marker. I was thinking that I could get this from the the numbers used for the "coverage" calculation in the "rel_ab_w_read_stats" output, which I gather is (# mapped reads) / (target length). Is there any way to get these numbers?

Adam Retchless

unread,
Jun 27, 2018, 2:14:03 PM6/27/18
to MetaPhlAn-users
Thanks for the feedback!
Reply all
Reply to author
Forward
0 new messages