A handy summary of the responses ....
[Paul] [Also pointed to Lindsey]
> I avoid pi_s because the value can vary a lot depending on the rules you use for counting synonymous sites (that denominator that you refer to).
Yes, although it's less bad if you don't have extreme composition skews or CUB
[Andrew] [Also pointed to Simon]
> As far as I'm aware, yes. I have a script for taking a reference fasta and gff file and parsing codon degeneracy per gene. I'm happy to share, but it's written in R
> A related frustration is that, as you point out, With SnpEff it's easy to sort a vcf into syn and non-syn variants
> SnpEff HAS TO be doing labelling codon degeneracy under the hood to make the syn/non-syn calls, but there's no way to pull out that intermediate information.
I think these are separate problems. It's relatively easy to say if a SNP is synonymous or nonsynonymous, given the ancestral state of the sequence it arose on. But, as Paul points out, there is a veritable zoo of ways of deciding the effective number of syn/nonsyn 'sites', depending on what mutational or compositional biases you want to account for! But there are many standard ones in the literature
[Simon]
> I have my own pipeline that does this from VCF without using any packages (apart from numpy).
> I believe Dom wrote his own too, so if you write one we will almost have four-wheel drive!
> One thing to think about is whether your definition of a 4-fold degenerate site also considers whether there are variants in your dataset at positions 1 and 2 in the codon.
Yes, and this also applies to Syn/Nonsyn 'site' counting. Although, unless diversity is very high, the quantitative impact for within population diversity is probably minimal
> Not sure if SnpEff does that?
SnpEff only classifies variants (syn / nonsyn). It does not take any of the harder decisions about effective 'sites'
[Billy]
> I’m somehow still on this…
How are you still on this list?!
> I use popgenome in R for everything. A bit of a learning curve but it can do it all.
Does popgenome actually take a stab at the number of sites for you? What method does it use?
[Lindsey]
> Usually I work with individual gene alignments, so the most relevant wheel to your question is a perl script that parses an embl file of a
> reference to get a bed file of fourfold degenerate sites.
> It was pretty simple and I remember Dom had some objection to it just looking at a single reference sequence, though I can't remember the details.
[Ben]
> My 2c is that I find it helpful to construct SFSes per gene (or per gene's introns) and apply functions to those to calculate pop gen summary statistics.
> Sum the SFSes for point estimates and bootstrap them for CIs (assuming recombination).
> You're welcome to the sumstat-from-sfs code that I have in R and/or python if it's helpful. They are an amalgam of Rob Ness', Tom Booker's and my own work...
All reasonable, although it doesn't say whether you venture into counting effective numbers of synonymous 'sites', or which sequence you use to choose whether a site is four-fold!
[Deborah]
> Not sure why you ask whether your definition of a 4-fold degenerate site also considers whether there are variants in your dataset at positions 1 and 2 in the codon
> — surely 4x sites are all in 3rd positions of codons, aren’t they?
[Simon]
~ How many 4fold sites are there in a codon that is segregating for CTC, CTT and TTT?
> In the past I have only considered 4D sites in codons where the first two positions are invariant across the entire dataset
I have done this, but ...I would be slightly concerned that this introduces biases in the case of divergent balanced haplotypes. Although that is unlikely to be a problem at the genome-average scale
> (including outgroup species when available). What do you think?
I have done this, but ...I would be somewhat /more/ worried that this introduces biases for balanced or other very strongly selected haplotypes.
BUT then I would also be worried about using a systematically different set of sites for K and pi
[Konrad]
> Four wheels?? More likely a centipede.
Centipedes don't have wheels
> This script has been written so many times.
> Just in the IEB postdocs there are versions from: DanHalligan, RobNess, BeateNurnberger, Lindsey PlenderleithDom, Jose Campos, Dom Laetsch...
Yes, and obviously I also have versions of my own, either slow in R or ...
... Since SnpEff is reasonably competent to recognize whether a mutation is syn or nonsyn, and since vcftools is competent to calculate diversity at variant sites provided to it in a vcf, the challenge in my mind has been to choose a preferred estimator of the number of effective sites in each class. Its annoying that SNPeff and/or VCF tools does not implement some standard approach!
In the (distant) past I have used the amusingly venerable
https://www.hiv.lanl.gov/content/sequence/SNAP/help_files/README.html (also
https://anaconda.org/bioconda/perl-snap ) which counts sites according to Nei-Gojobori. This is not so bad if base-composition at 3rd positions is not too skewed (etc) and divergence is not high (not a problem for within-population data!).
I have also used PAML (inc. pairwise) to get ML estimates, but this depends on a codon model in which you need to make more subtle decisions about base composition and substitution models, and what these imply for the mutational-opportunity count of sites (Paul's point). And you need to create interim trees and alignments, which is a bit silly in the age of the VCF!
My reason for asking is that
(1) It would be convenient to start with a VCF and a GFF/GTF
(2) For 4fold and 0fold I've never decided what it is best to do with codons that are segregating for more than two positions (Simon's problem)
Thanks all!
D
--
Darren Obbard
darren...@ed.ac.uk
Institute of Evolutionary Biology
University of Edinburgh
Ashworth Laboratories, Charlotte Auerbach Road
Edinburgh EH9 3FL
Office 0131 651 7781
Mobile: 07968 838 635
http://obbard.bio.ed.ac.uk/
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.