Most of my samples contain low and often sporadic coverage across a contig. I am not interested in the SNVs in samples with a low portion of contig coverage. It looks like you can filter our low coverage positions (in all samples), but I want to filter out samples with a low portion of coverage across a bin of contigs. Coverage can be high over small regions so a minimum positional coverage would allow some SNVs to remain despite that most of the contig is not covered at all. There is a table within the PROFILE.db that gives me this information, but it appears that portion_covered_contigs and portion_covered_splits are both the same split-level files? Is this a mistake? I was hoping to use portion_covered at the contig level.
I ultimately want to take an averaged measure of heterogeneity (departure_from_consensus) from each bin. I would then dig into sample-level heterogeneity to see what happens over time. For bins that appear for brief windows of time, this won’t be very fruitful but for bins that are present across many samples, it could be interesting regarding selective sweeps, etc. What do I need to do to normalize such a value to account for all of the positions that are not variable within these contigs (normalize by total contig length?).
I haven’t yet examined the correlation between depth and SNV frequency but a colleague of mine said he sees a correlation. Do you think this is an issue in the analysis I want to do? What should I do to ensure that SNV frequencies across positions of different depths are comparable? Maybe keep only those that have similar coverage values?
Because anvio can’t yet handle our full dataset, we have decided to generate bins from our entire dataset with metabat and then select a subset of bins to be imported into anvio. We look at things like checkm completeness, coverage profiles and taxonomy to decide which bins are of interest. I then generate a new anvio database with all of the relevant contigs, bam files, taxonomy, gene-calls and collection data for my selected bins. We have noticed (warning N=2) a larger discrepancy between checkM and anvio completeness scores within a bin that contains many smaller contigs. Since anvio does not translate partial genes, I assume that partial genes are then removed from the HMM search for SCGs. Is this correct? This could explain why a bin with long contigs show more congruency between the completeness measures in checkM versus anvio.
Cheers,
Meghan
I want to filter out samples with a low portion of coverage across a bin of contigs.
There is a table within the PROFILE.db that gives me this information, but it appears that portion_covered_contigs and portion_covered_splits are both the same split-level files? Is this a mistake? I was hoping to use portion_covered at the contig level.
I ultimately want to take an averaged measure of heterogeneity (departure_from_consensus) from each bin. I would then dig into sample-level heterogeneity to see what happens over time. For bins that appear for brief windows of time, this won’t be very fruitful but for bins that are present across many samples, it could be interesting regarding selective sweeps, etc.
What do I need to do to normalize such a value to account for all of the positions that are not variable within these contigs (normalize by total contig length?).
I haven’t yet examined the correlation between depth and SNV frequency but a colleague of mine said he sees a correlation. Do you think this is an issue in the analysis I want to do? What should I do to ensure that SNV frequencies across positions of different depths are comparable? Maybe keep only those that have similar coverage values?
Because anvio can’t yet handle our full dataset, we have decided to generate bins from our entire dataset with metabat and then select a subset of bins to be imported into anvio.
We look at things like checkm completeness, coverage profiles and taxonomy to decide which bins are of interest. I then generate a new anvio database with all of the relevant contigs, bam files, taxonomy, gene-calls and collection data for my selected bins.
We have noticed (warning N=2) a larger discrepancy between checkM and anvio completeness scores within a bin that contains many smaller contigs. Since anvio does not translate partial genes, I assume that partial genes are then removed from the HMM search for SCGs. Is this correct? This could explain why a bin with long contigs show more congruency between the completeness measures in checkM versus anvio.