Normalized clade counts

503 views
Skip to first unread message

T Alt

unread,
May 1, 2015, 6:00:20 PM5/1/15
to metaphl...@googlegroups.com
Hello to Nicola and the MetaPhlAn team!

I am interested in getting normalized clade counts out of MPA. To use the phrasing of the paper, I would like the "normalized cell counts" per clade.

"To estimate each clade's relative abundance in terms of cell counts (that is, whole-genome counts, rather than single-read counts), the MetaPhlAn classifier normalizes the total number of reads in each clade by the nucleotide length of its markers."

To the best of my knowledge, MPA will generate a compositional normalization (i.e., compute proportions) for clades, but only provides normalized abundances on a per-marker basis in the 'clade_profiles' output. Unfortunately neither is exactly what I need for my work, as comparing proportions or compositional data is statistically problematic.

Any advice on how to output normalized abundances per clade would be greatly appreciated! Let me know if I am not clear.

Best regards,

~Tomer Altman


Nicola Segata

unread,
May 3, 2015, 4:14:48 PM5/3/15
to T Alt, metaphl...@googlegroups.com
Hi Tomer,
 what we mean by "normalized clade counts" is the standard output of MetaPhlAn. "Normalized" here refer to the per-sample normalization which results in proportional data. What we wanted to stress is that the relative abundance are proportional to the actual number of cells rather than to the fraction of reads coming from each genome. Notice that these two mentioned counts can substantially differ when considering organisms with large differences in their genome lengths.

I hope this helps
cheers
Nicola

--
You received this message because you are subscribed to the Google Groups "MetaPhlAn-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metaphlan-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

T Alt

unread,
May 3, 2015, 4:48:17 PM5/3/15
to metaphl...@googlegroups.com, tal...@gmail.com, nicola...@unitn.it
Hi Nicola,

Thank you for your reply, and for the clarification.

I see that the paper describes it as:

"Relative abundances are estimated by weighting read counts assigned using the direct method with the total nucleotide size of all the markers in the clade and normalizing by the sum of all directly estimated weighted read counts."

What I want is the quotient of all read counts for a clade, and the total length of all markers for the clade:

# clade hits / \sum clade markers

In other words, I don't want the proportional normalization performed, as that causes statistical problems.

Is there a way to get MetaPhlAn to do this?

Thanks,

~Tomer

T Alt

unread,
May 6, 2015, 12:28:56 AM5/6/15
to metaphl...@googlegroups.com
Hi Nicola,

I know that you run this mailing list as a service to the research community, so I don't want to pester you. Can you let me know what's a typical turn-around time for replying? That way I won't bother you and I won't have uncertainty about whether a question is dropped.

Thanks,

~Tomer

Nicola Segata

unread,
May 6, 2015, 2:57:33 AM5/6/15
to metaphl...@googlegroups.com
Hi Tomer,
  getting the raw number of hits per marker is not an already available option in MetaPhlAn. I do understand your point in using count data, but you should also consider that our normalization is not only with respect to the total number of hits, but other normalizations are performed. For example, we estimate the coverage of each marker and for this we need to normalize by the length of the marker to compare different markers. 

In any case, I modified the current MetaPhlAn2 script, adding the "-t marker_counts" option that should be what you need. I'm attaching it to this message. If you have a chance to double check the results we may want to add it to the official repository.

cheers
Nicola
metaphlan2.py
Message has been deleted

T Alt

unread,
May 12, 2015, 3:50:46 PM5/12/15
to metaphl...@googlegroups.com
Hi Nicola,

Thanks for sending this modified version. I appreciate your support!

Inspecting the code, it looks like this version of MetaPhlAn will report a table of markers and the per-marker normalized counts (but not proportions). What I really would like is the per-clade reporting of normalized counts (not proportions). That would enable me to compare species normalized counts (not proportions) across samples. Does that make sense?

Cheers,

~Tomer

gavin wang

unread,
Aug 29, 2017, 11:06:37 PM8/29/17
to MetaPhlAn-users, tal...@gmail.com, nicola...@unitn.it
在 2015年5月4日星期一 UTC+8上午4:48:17,T Alt写道:
Hi Nicola,

I am confused about the description "Relative abundances are estimated weighting read counts assigned using the direct method with the total nucleotide size of all the markers in the clade, and normalizing by the sum of all directly-estimated weighted read counts" in your NM paper. How to weight the read counts with the total nucleotide size of all the markers? is it RPK (reads per kil-base) or something else?

gavin wang

unread,
Aug 29, 2017, 11:07:58 PM8/29/17
to MetaPhlAn-users
Reply all
Reply to author
Forward
0 new messages