Comparing groups of samples; group-based profiles(?)

145 views
Skip to first unread message

Robert Kwapich

unread,
Sep 11, 2017, 8:59:37 PM9/11/17
to Anvi'o
Hi Community! I try to follow several tutorials from meren's lab web page - but for my purposes. I was following basic "metagenomic workflow" tutorial and lately Recovering HBDs from TARA Oceans Metagenomes. I essentially try to compare two groups of metagenomes from human gut: (1) and (2), let's say control and some disease case. I have around 30 samples in EACH group. Following the initial tutorial, I've co-assembled the entire dataset (from all reads), created contigs.db and annotated it with taxonomy and some functional annotations, then profiled each sample separately (creating bam files and later profiles *.db), up to a point where I have merged profiles (per sample basis). But I see that in mentioned TARA Ocean Metagenomes workflow authors compare groups of samples (i.e. different metagenomes), and I would rather do that since I care more about groups than samples. I.e. my scientific question is very general: how do these two groups differ between each other? Does the case group lack certain MAGs? And perhaps there are certain MAGs detected only in case group? I got inspired by Meren's paper on fecal transplants but his number of samples was smaller (10), compared to my 70. My question here: is there a way, after all these computations to "collapse" samples into groups? If not, then am I thinking correctly that I should take raw reads, concatenate them into desired "groups" (1) and (2) and then map them as you propose in basic "metagenomic workflow"?  

Also, displaying around 70 samples through anvi-interactive is very "laggy" (slow). It makes manual curation of bins through anvi-refine very frustrating. And CONCONT identifed 80 bins, among them only 1 has redundancy <10% and >90% completion. The rest needs to be manually curated. I thought that by collapsing samples into two categories (CASE vs CONTROL) visualizing differences, and manual curation of BINS could be easier. 

Since I already have annotated contigs.db, I was thinking of concatenating RAW reads after QC into CASE.fa and CONTROL.fa and created .BAM files, then profiles (*.db). What are your comments here?

Best regards,
Robert.

Robert Kwapich

unread,
Sep 13, 2017, 5:38:07 PM9/13/17
to Anvi'o
Put simpler,

I am terrified when I look at my "Bin_1" having around 70 samples. Although I run everything on the server, I still have to plot this in my browser. And, well... it is a resource killer. 



Instead of ~70 samples seen as coaxial-circles, perhaps merging samples into two (biologically relevant) groups of Control samples and Case samples would help?  As here I'd like to briefly refine my bins.

A. Murat Eren

unread,
Sep 13, 2017, 10:00:57 PM9/13/17
to Anvi'o
Hi Robert,

On Mon, Sep 11, 2017 at 7:59 PM, Robert Kwapich <robert....@gmail.com> wrote:
I essentially try to compare two groups of metagenomes from human gut: (1) and (2), let's say control and some disease case. I have around 30 samples in EACH group. Following the initial tutorial, I've co-assembled the entire dataset

​I don't think co-assembling human gut metagenomes is a good practice unless you are co-assembling a time-series data from a single individual. There is just too much interpersonal variation in the gut to get meaningful co-assemblies in most cases. If you are talking about 60 samples, I would be extremely surprised if you get even a single bin from this workflow.​

 
But I see that in mentioned TARA Ocean Metagenomes workflow authors compare groups of samples

​Dividing metagenomes into multiple meaningful groups is helpful in most cases if you know how to collapse redundancy as we did it in TARA Oceans work, but even this will not work for your dataset. 

In the FMT work we co-assembled 4 samples collected temporally from a single individual to track population genomes in recipient guts. It wasn't even a co-assembly of 10 samples, in fact, ​

My question here: is there a way, after all these computations to "collapse" samples into groups?

​In my opinion, your best bet is to assemble every metagenome independently. Resolving population genomes properly from there is a challenging task. There are multiple options. First you can map each metagenome to its own assembly, and use that to try to resolve population genomes. But this will not give you enough power to discriminate populations accurately as you will not have the differential coverage aspect. Two, you can try to map all control or all treatment metagenomes to your individual assembly, and carefully resolve genomes from that mess (it will be a mess due to the pangenomic nature of closely related populations that will create patch mapping patterns that will confuse binning algorithms or your manual curation steps; I think there are ways to deal with that, but it is not easy to write about them). 

My 2 cents.


Best,

Robert Kwapich

unread,
Sep 14, 2017, 12:33:08 PM9/14/17
to Anvi'o
Thanks for your answer Meren,

​I don't think co-assembling human gut metagenomes is a good practice unless you are co-assembling a time-series data from a single individual. There is just too much interpersonal variation in the gut to get meaningful co-assemblies in most cases. If you are talking about 60 samples, I would be extremely surprised if you get even a single bin from this workflow.​

So I shouldn't trust even a bin with 96% completion and 4.3% redundancy from this co-assembly?

I was, perhaps naively thinking that co-assembly of several samples from gut metagenome would produce better assemblies since I am feeding more data. Assuming there are low-abundance species, increasing input size from same cohorts could perhaps give me a chance to assembly these low abundance-bacteria. 

I know that even healthy microbiome differs from individual to individual, but, correct me if I am wrong, would co-assembly here suffer just because I used let's say 30 or 60 samples? Isn't "pangenomic nature of closely related populations" also a thing for a single (deep sequenced) sample? I.e. is it not inherent to the problem at hand, trying to understand and (perhaps) makes sense of microbiome, metagenome?

A. Murat Eren

unread,
Sep 14, 2017, 12:53:13 PM9/14/17
to Anvi'o
Hi Robert,

On Thu, Sep 14, 2017 at 11:33 AM, Robert Kwapich <robert....@gmail.com> wrote:
So I shouldn't trust even a bin with 96% completion and 4.3% redundancy from this co-assembly?

​No, you certainly should, it is just that it may not be the case very often.​


I was, perhaps naively thinking that co-assembly of several samples from gut metagenome would produce better assemblies since I am feeding more data. Assuming there are low-abundance species, increasing input size from same cohorts could perhaps give me a chance to assembly these low abundance-bacteria. 

​This is a correct assumption, and it may have helped you to recover some of the rare everywhere populations.​ But on the other hand the co-assembly strategies involving multiple individuals can be challenging for abundant populations.


I know that even healthy microbiome differs from individual to individual, but, correct me if I am wrong, would co-assembly here suffer just because I used let's say 30 or 60 samples? Isn't "pangenomic nature of closely related populations" also a thing for a single (deep sequenced) sample? I.e. is it not inherent to the problem at hand, trying to understand and (perhaps) makes sense of microbiome, metagenome?

​It *is* a problem even for a single sample, but a single sample will certainly have less complexity than multiple when it comes to healthy gut samples.

Of course these are all things *you* need to consider because you have the data to recover from all the wrong assumptions :)


Best,

Les Dethlefsen

unread,
Sep 14, 2017, 1:53:32 PM9/14/17
to an...@googlegroups.com
Hi Robert,

I’m also working with the human gut microbiota, and I concur with Meren: co-assembly across subjects is problematic.  There’s simply too much interpersonal variability, at all levels ranging from the proportions of major phyla to strain-level variation within the most prevalent gut species.  Feeding more data into an assembly is only helpful if the additional reads are covering the same genomes, and it’s far from clear that unrelated human adults share many of the same genomes. (‘Same’ here is defined operationally by the stringency of your assembly algorithm for overlapping reads…’same genome’ is a lot more stringent than belonging to a prevalent gut species, as assessed by 16S studies looking for a ‘core human gut microbiota’.)  Especially because the ‘same’ genome would have to be not just present, but reasonably abundant in multiple samples to be the source of many additional reads.  

 You could consider an approach like the original MetaHIT paper (Qin 2010) and do the within-subject assembly first, then merge unassembled reads across subjects for another round of attempted assembly…but I haven’t had to really think through this issue myself, and the MetaHIT paper was just trying to get long enough contigs from 75-base GA-II reads to reliably call genes, and then just analyzed the gene catalog. They weren’t aiming for MAGs.

Depending on the depth of your sequencing per sample (per subject), you might not have much success getting MAGs for each subject.  I remember from the first Banfield lab forays into assembly-based metagenomics of complex samples (contaminated aquifer in their case), they were shooting for ~30 GB of sequencing from the community to be able to get partial MAGs from dozens of microbial strains.  Think about that ratio…roughly a factor of 1000 between raw sequencing depth within one complex community and summed length of things large enough to reasonably called MAGs, as opposed to just contigs.

If you don’t have (and can’t get) the depth to assemble much within your individual subjects, you might consider trying to map your reads to the updated human gut gene catalog.  Li et al. 2014 Nature Biotechnology is a direct descendent of the original MetaHIT catalog, and would likely be both easier to work with and more appropriate than a general reference database such as UniRef or NCBI genomes.


Les Dethlefsen
Relman Lab
Stanford University
deth...@stanford.edu

--
Anvi'o Paper: https://peerj.com/articles/1319/
Project Page: http://merenlab.org/projects/anvio/
Code Repository: https://github.com/meren/anvio
---
You received this message because you are subscribed to the Google Groups "Anvi'o" group.
To unsubscribe from this group and stop receiving emails from it, send an email to anvio+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/anvio/67bf526a-9650-494d-bfe7-102d39cf0077%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages