Bin heterogeneity - SNV analysis

58 views

Skip to first unread message

Meghan Chafee

unread,

Dec 9, 2016, 9:13:49 AM12/9/16

to Anvi'o

Most of my samples contain low and often sporadic coverage across a contig. I am not interested in the SNVs in samples with a low portion of contig coverage. It looks like you can filter our low coverage positions (in all samples), but I want to filter out samples with a low portion of coverage across a bin of contigs. Coverage can be high over small regions so a minimum positional coverage would allow some SNVs to remain despite that most of the contig is not covered at all. There is a table within the PROFILE.db that gives me this information, but it appears that portion_covered_contigs and portion_covered_splits are both the same split-level files? Is this a mistake? I was hoping to use portion_covered at the contig level.

I ultimately want to take an averaged measure of heterogeneity (departure_from_consensus) from each bin. I would then dig into sample-level heterogeneity to see what happens over time. For bins that appear for brief windows of time, this won’t be very fruitful but for bins that are present across many samples, it could be interesting regarding selective sweeps, etc. What do I need to do to normalize such a value to account for all of the positions that are not variable within these contigs (normalize by total contig length?).

I haven’t yet examined the correlation between depth and SNV frequency but a colleague of mine said he sees a correlation. Do you think this is an issue in the analysis I want to do? What should I do to ensure that SNV frequencies across positions of different depths are comparable? Maybe keep only those that have similar coverage values?

Because anvio can’t yet handle our full dataset, we have decided to generate bins from our entire dataset with metabat and then select a subset of bins to be imported into anvio. We look at things like checkm completeness, coverage profiles and taxonomy to decide which bins are of interest. I then generate a new anvio database with all of the relevant contigs, bam files, taxonomy, gene-calls and collection data for my selected bins. We have noticed (warning N=2) a larger discrepancy between checkM and anvio completeness scores within a bin that contains many smaller contigs. Since anvio does not translate partial genes, I assume that partial genes are then removed from the HMM search for SCGs. Is this correct? This could explain why a bin with long contigs show more congruency between the completeness measures in checkM versus anvio.

Cheers,

Meghan

A. Murat Eren

unread,

Dec 9, 2016, 10:27:08 AM12/9/16

to an...@googlegroups.com

Dear Meghan,

Thank you for all the great questions ;)

On Fri, Dec 9, 2016 at 9:13 AM, Meghan Chafee <meghan...@gmail.com> wrote:

I want to filter out samples with a low portion of coverage across a bin of contigs.

Yes, it does

make sense

. You can

use portion-coverage estimates to identify samples with very low / patch coverage, and then use the `--samples-of-interest` parameter to not report any SNV information from them. Additionally

you can use the outlier column in the SNV profile to say "I will only use positions that are not outliers at the contig-level". If a contig is not well-detected, but have multiple peaks due to non-specific mapping, all those positions will end up being marked as outliers

in the output.

It is a legitimate concern to not include SNVs from patchy samples.

There is a table within the PROFILE.db that gives me this information, but it appears that portion_covered_contigs and portion_covered_splits are both the same split-level files? Is this a mistake? I was hoping to use portion_covered at the contig level.

This is an anvi'o trick. Both tables are valid

, but slightly different:

the table

portion_covered_contigs, values for each split represents the contig averages, and we use it for clustering operations to make sure splits are connected to each other

. In

the table

portion_covered_splits, values are truly split averages, and this table is used for visualization purposes

so splits originating from the same contig stay together during clustering, but you see finer scale differences between them when you visualize them.

I ultimately want to take an averaged measure of heterogeneity (departure_from_consensus) from each bin. I would then dig into sample-level heterogeneity to see what happens over time. For bins that appear for brief windows of time, this won’t be very fruitful but for bins that are present across many samples, it could be interesting regarding selective sweeps, etc.

Yes, we completely agree, and we have a summary statistic,

"SNV density" (number of SNVs per kbp for a given genome bin)

, to estimate the sample level heterogeneity of a given genome bin

What do I need to do to normalize such a value to account for all of the positions that are not variable within these contigs (normalize by total contig length?).

The SNV density is normalized by bin length, already. If you do your own filtering from the SNV profiles output, I suggest you also use a similar normalization that relies on length, so increases and decreases of SNV density values translate to some intuitive understanding.

I haven’t yet examined the correlation between depth and SNV frequency but a colleague of mine said he sees a correlation. Do you think this is an issue in the analysis I want to do? What should I do to ensure that SNV frequencies across positions of different depths are comparable? Maybe keep only those that have similar coverage values?

There are *many* factors that affect the emergence of SNVs and their frequencies in mapping results. The relationship between coverage and frequencies could be due to blooming events, for instance: coverage goes high

as a whole

but

average frequency ratios of competing nucleotides (n2n1ratio)

, which is our proxy to strains contributing to this context,

goes down, because let's say one fraction of the population is getting crazy with their expansion. Or it could be due to non-specific mapping because of multi-copy genes or gene conservancy

across members of a given clade

: you have a region that is covered five times more than the upstream and downstream, and you recover much more SNVs from that part with higher n2n1ratio. We se

both

phenomena in our data, and more

. There is no such thing as a "correlation" that will mean the same thing for every environment, or every genome bin in a single environment, or every gene within a genome bin, or every nucleotide position in a

single

gene. Both technical limitations that affect our sensitivity and specificity, as well as the complex inner-workings of natural environments will make sure that it is the case.

t takes real effort to tease these apart

, and there are many questions to answer

. I am not very fond of grand conclusions when things are so very intricate, and we have a poor understanding regarding technical and biological reasons behind what we are seeing. So you need to dig in

, a

nd let us know

The sole purpose of anvi'o is to help you do that, and we can improve it if it is lacking something that may be useful.

Because anvio can’t yet handle our full dataset, we have decided to generate bins from our entire dataset with metabat and then select a subset of bins to be imported into anvio.

The best practice would be to

importing everything

and

use

anvi-refine to go through

bins.

But profiling

many many very large

BAM files

is currently is an issue with anvi'o. We don't take this lightly and we are appropriately embarrassed about it, b

ut we will fix that soon. Ozcan on his way to the US, and

one of

his priorit

ies

is to redesign the profiling

step

. We will see how much improvement we will get.

We look at things like checkm completeness, coverage profiles and taxonomy to decide which bins are of interest. I then generate a new anvio database with all of the relevant contigs, bam files, taxonomy, gene-calls and collection data for my selected bins.

This may not be the best way

(especially to study SNVs)

, because probably you increase the amount of non-specific mapping

when you generate new BAM files by mapping everything in your metagenome to a subset of your contigs

. Do you know anvi-profile has a parameter called `--contigs-of-interest`? So you can use the original BAM files to subset for those contigs of interest coming from a target bin. This way, you wouldn't have to redo the mapping, and rely on mapping results that considered your entire contig collection. I would strongly suggest

everyone who is in a similar situation

to look into this option, because this may have a huge impact on your SNV analyses. I would also be very happy to see a comparison for future references :) I.e., you take contigs of a bin and map metagenomic data to create a BAM file to profile with anvi'o (which you already did), and using the --contigs-of-interest parameter to fish for only those contigs from your original BAM file, and look at how things differ between the two.

We have noticed (warning N=2) a larger discrepancy between checkM and anvio completeness scores within a bin that contains many smaller contigs. Since anvio does not translate partial genes, I assume that partial genes are then removed from the HMM search for SCGs. Is this correct? This could explain why a bin with long contigs show more congruency between the completeness measures in checkM versus anvio.

We do not see a remarkable discrepancy between CheckM and anvi'o in our analyses, maybe we will do a more systematical analysis here to find out about those situations and better understand why thy occur. We regularly use CheckM and like it a lot, so it will be easy.

But If prodigal identifies them (whether they are partial or not), the AA sequences are stored in the database, and HMMs are run on these genes for domain SCGs. So I am not sure whether partial genes could be an explanation to what you are seeing, but hopefully there will be more soon.

Thanks again,

Best.

Reply all

Reply to author

Forward

0 new messages