Shotgun rarefactions - metagenomics (microbiome), MetaPhlAn2

Robert Kwapich

unread,

May 8, 2017, 11:00:38 AM5/8/17

to MetaPhlAn-users

Hi there community!

For some time I was working os 16S rRNA gene survey data. For this type of analysis one could use a rarefaction approach in order to have the same depth for each sample. Having different depths for each sample is sometimes referred to as searching 1 square meter of amazon jungle and 1 square kilometer of mojave desert and then comparing OTUs, taxons, etc... It is relatively easy to employ a rarefaction, as it is implemented in many software packages: qiime, mothur.

I have now a shotgun dataset - a whole genome sequencing of microbiome. For the start I am using a microbiome helper SOP. For taxonomy assignement I use MetaPhlAn2 approach. MetaPhlAn2 wiki doesn't even mention rarefaction. Since this step might be crucial for comparative analyses, where I have two groups/categories, each containing around 30 samples I want to have each sample as "standardized" as possible. Are there any approaches two rarefy WGS data? Is there a reason why I has not been yet implemented in for example MetaPhlAn2?

I'd be grateful for any insight, comments and suggestions.

Nicola Segata

unread,

May 8, 2017, 4:24:14 PM5/8/17

to Robert Kwapich, MetaPhlAn-users

Hi Robert,

thanks for getting in touch. For shotgun metagenomics you should rarefy the input metagenomes to the desired depth. For this subsampling operation you could use seqtk or other tools.

I hope this helps

thanks

Nicola

--
You received this message because you are subscribed to the Google Groups "MetaPhlAn-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metaphlan-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Robert Kwapich

unread,

May 8, 2017, 4:44:03 PM5/8/17

to MetaPhlAn-users, robert....@gmail.com, nicola...@unitn.it

Thanks, this was really something that I needed.

I am a beginner in this field, I read some papers, some standard operation procedures (SOPs), or videos from workshops but nowhere was it mentioned (reminded) that this type of analyses really need to account for differences in depth - which seems obvious now, but I totally neglected it.

Thanks again Nicola.
Best regards,
Robert.

Robert Kwapich

unread,

May 8, 2017, 4:51:53 PM5/8/17

to MetaPhlAn-users, robert....@gmail.com, nicola...@unitn.it

Just one more question. How would you go about to make a rarefaction plots when my inputs are raw, decontaminated (read to go) .fastq files, that span in depth from around 1.2 million reads up to 52.5 mln reads? Does it have to be this labour intensive, as to compute for example metaphlan2 analyses each time for each rarefaction step?

Nicola Segata

unread,

May 8, 2017, 5:02:29 PM5/8/17

to Robert Kwapich, MetaPhlAn-users

Yes, I see no other way than recomputing the profiles at each rarefaction step. However, I will not be surprised if the profiles do not change much at different steps. Differently from 16S sequencing, metagenomics is less prone to noise due do slightly different sequencing depths. For this reason is rare to see a metagenomic study with more than 2 or 3 rarefaction steps...

Morgan Langille

unread,

May 9, 2017, 3:07:18 PM5/9/17

to Nicola Segata, Robert Kwapich, MetaPhlAn-users

I appreciate the concern about unequal sampling depth and agree with Nicola that rarefaction is one possible way to remove this bias. However, I did want to comment that currently in the research field this is practice is not being routinely done. In fact of the many metagenomic projects/papers that I have come in contact with I have not seen anyone normalize their samples by rarefying to the same sequence depth. Traditionally (which does not mean it is correct), people tend to simply "normalize" by converting their counts to relative abundances. After that people will apply other normalization methods that are generally applicable (sqrt), accounting for genome size differences (Humann and others), and/or consider compositional data (e.g. MUSSIC, etc.). Alternatively, DESeq2 is often applied to correct for sequencing differences without having to throw out data. A google of "metagenomic sample normalization" turns up lots of related articles including an entire MSc thesis (http://publications.lib.chalmers.se/records/fulltext/245898/245898.pdf), but I'm not sure if anything is conclusive at this point.

I think a practical comparison of these different approaches on actual biological conclusions would be warranted, but they are likely to differ depending on the annotation system used (Metaphlan, Humann, kraken, etc.).

Again, I'm not saying subsampling/rarefaction is incorrect, but rather that the field itself does not always do it this way. Hope this helps and I would be interested to hear what others have done with their data.

-------------------------------------------

Dr. Morgan Langille

Canada Research Chair in Human Microbiomics

Assistant Professor

Department of Pharmacology, Dalhousie University

http://morganlangille.com

Director

Integrated Microbiome Resource (IMR)

http://cgeb-imr.ca

To unsubscribe from this group and stop receiving emails from it, send an email to metaphlan-users+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "MetaPhlAn-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to metaphlan-users+unsubscribe@googlegroups.com.

Robert Kwapich

unread,

May 9, 2017, 3:59:04 PM5/9/17

to MetaPhlAn-users, nicola...@unitn.it, robert....@gmail.com

Will relative abundances not cause additional, unnoticed problems?

Let's say there is a particular bug (let it be species/genus) that is overly abundant in some disease state. While we are sticking to absolute abundances we'd observe perhaps more sequences assigned to bug A, while the others (let's say B,C,D) would maybe stay unaffected - i.e. the same absolute abundances. So this situation would reflect that disease is correlated with overabundance of bug A.

Then, considering relative abundances: increased abundance of a bug A would inevitabely decrease relative abundances of bugs B,C,D. Would we not draw false conclusions as to correlations?

This though comes from Dan Knight (you can see first 3 minutes of a movie here: https://www.youtube.com/watch?v=X60nFYpLWRs).

What are your thoughs on this?

Morgan Langille

unread,

May 9, 2017, 6:58:14 PM5/9/17

to Robert Kwapich, MetaPhlAn-users

Yes this is a problem with compositional data (e.g. relative abundance) but is a problem no matter if rarefied or divided by sum total reads since the interpretation of data is in relative abundance no matter what normalization method is applied. You can't get absolute values from the data.

Morgan

To unsubscribe from this group and stop receiving emails from it, send an email to metaphlan-users+unsubscribe@googlegroups.com.

Robert Kwapich

unread,

May 10, 2017, 11:45:06 AM5/10/17

to MetaPhlAn-users, robert....@gmail.com

But what if I rarefy input sequences before taxonomic assignment to a particular number of reads, say 10M, and then obtain the number of reads that were assigned to a particular bug? What might be wrong with this approach? Assuming a fixed depth we could then see whether we've observed more of bug A compared between samples.

Morgan Langille

unread,

May 10, 2017, 12:11:32 PM5/10/17

to Robert Kwapich, MetaPhlAn-users

That approach is fine (although some statisticians would argue that throwing away data is not the correct approach : http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003531).

"Assuming a fixed depth we could then see whether we've observed more of bug A compared between samples." You could tell if there is more of bug A in comparison to bug B between the samples, but you would not know if there is more or less abundance of bug A (e.g. physical cell count) between samples.

-------------------------------------------

Dr. Morgan Langille

Canada Research Chair in Human Microbiomics

Assistant Professor

Department of Pharmacology, Dalhousie University

http://morganlangille.com

Director

Integrated Microbiome Resource (IMR)

http://cgeb-imr.ca

To unsubscribe from this group and stop receiving emails from it, send an email to metaphlan-users+unsubscribe@googlegroups.com.

Robert Kwapich

unread,

Jun 2, 2017, 1:02:04 PM6/2/17

to MetaPhlAn-users, robert....@gmail.com

Hi there again!

I was thinking of using Generalized Linear Models (GLM) for inferring statistically significant differences between two study groups also controlling for potential confounding factors. I know there is MaAsLin (Multivariate Association with Linear Models) developed by Huttenhower lab.

My problem is that I have varying sequencing depths for gathered samples: from 52.2M reads to 1.68M reads. Computing ABSOLUTE abundances, and then using Generalized Linear Models that account for differences in number for observations would allow also to control for confounding factors for metadata file.

Is this a good idea? I can't find a way to compute absolute abundances with MetaPhlAn2. What do you think about obtaining a BASE that is equal to number of reads for each sample? I.e. when certain bacterial phylum represents 40% of sample, I could compute the BASE as 40% of number of reads for that sample.

GLM is from edgeR R-cran package, a nice wrapper written by Dan Knights is here: https://github.com/danknights/mice8992-2016/blob/master/src/wrap.edgeR.r

What do you think of this approach?

Does MaAsLin assume normal distribution of taxa? Or normal distribution of residuals?

Best,
Robert.

Reply all

Reply to author

Forward