Re: denominators for pi_s and pi_4fold - Answer summary

39 views

Skip to first unread message

OBBARD Darren

unread,

Mar 14, 2022, 6:04:15 PM3/14/22

to evo...@lists.ed.ac.uk, ashworth-c...@googlegroups.com, WANG Yiguan

A handy summary of the responses ....

[Paul] [Also pointed to Lindsey]
> I avoid pi_s because the value can vary a lot depending on the rules you use for counting synonymous sites (that denominator that you refer to).

Yes, although it's less bad if you don't have extreme composition skews or CUB

[Andrew] [Also pointed to Simon]
> As far as I'm aware, yes. I have a script for taking a reference fasta and gff file and parsing codon degeneracy per gene. I'm happy to share, but it's written in R
> A related frustration is that, as you point out, With SnpEff it's easy to sort a vcf into syn and non-syn variants
> SnpEff HAS TO be doing labelling codon degeneracy under the hood to make the syn/non-syn calls, but there's no way to pull out that intermediate information.

I think these are separate problems. It's relatively easy to say if a SNP is synonymous or nonsynonymous, given the ancestral state of the sequence it arose on. But, as Paul points out, there is a veritable zoo of ways of deciding the effective number of syn/nonsyn 'sites', depending on what mutational or compositional biases you want to account for! But there are many standard ones in the literature

[Simon]
> I have my own pipeline that does this from VCF without using any packages (apart from numpy).
> I believe Dom wrote his own too, so if you write one we will almost have four-wheel drive!
> One thing to think about is whether your definition of a 4-fold degenerate site also considers whether there are variants in your dataset at positions 1 and 2 in the codon.

Yes, and this also applies to Syn/Nonsyn 'site' counting. Although, unless diversity is very high, the quantitative impact for within population diversity is probably minimal

> Not sure if SnpEff does that?

SnpEff only classifies variants (syn / nonsyn). It does not take any of the harder decisions about effective 'sites'

[Billy]
> I’m somehow still on this…

How are you still on this list?!

> I use popgenome in R for everything. A bit of a learning curve but it can do it all.

Does popgenome actually take a stab at the number of sites for you? What method does it use?

[Lindsey]
> Usually I work with individual gene alignments, so the most relevant wheel to your question is a perl script that parses an embl file of a
> reference to get a bed file of fourfold degenerate sites.
> It was pretty simple and I remember Dom had some objection to it just looking at a single reference sequence, though I can't remember the details.

[Ben]
> My 2c is that I find it helpful to construct SFSes per gene (or per gene's introns) and apply functions to those to calculate pop gen summary statistics.
> Sum the SFSes for point estimates and bootstrap them for CIs (assuming recombination).
> You're welcome to the sumstat-from-sfs code that I have in R and/or python if it's helpful. They are an amalgam of Rob Ness', Tom Booker's and my own work...

All reasonable, although it doesn't say whether you venture into counting effective numbers of synonymous 'sites', or which sequence you use to choose whether a site is four-fold!

[Deborah]
> Not sure why you ask whether your definition of a 4-fold degenerate site also considers whether there are variants in your dataset at positions 1 and 2 in the codon
> — surely 4x sites are all in 3rd positions of codons, aren’t they?

[Simon]
~ How many 4fold sites are there in a codon that is segregating for CTC, CTT and TTT?

> In the past I have only considered 4D sites in codons where the first two positions are invariant across the entire dataset

I have done this, but ...I would be slightly concerned that this introduces biases in the case of divergent balanced haplotypes. Although that is unlikely to be a problem at the genome-average scale

> (including outgroup species when available). What do you think?

I have done this, but ...I would be somewhat /more/ worried that this introduces biases for balanced or other very strongly selected haplotypes.

BUT then I would also be worried about using a systematically different set of sites for K and pi

[Konrad]
> Four wheels?? More likely a centipede.

Centipedes don't have wheels

> This script has been written so many times.
> Just in the IEB postdocs there are versions from: DanHalligan, RobNess, BeateNurnberger, Lindsey PlenderleithDom, Jose Campos, Dom Laetsch...

Yes, and obviously I also have versions of my own, either slow in R or ...

... Since SnpEff is reasonably competent to recognize whether a mutation is syn or nonsyn, and since vcftools is competent to calculate diversity at variant sites provided to it in a vcf, the challenge in my mind has been to choose a preferred estimator of the number of effective sites in each class. Its annoying that SNPeff and/or VCF tools does not implement some standard approach!

In the (distant) past I have used the amusingly venerable https://www.hiv.lanl.gov/content/sequence/SNAP/help_files/README.html (also https://anaconda.org/bioconda/perl-snap ) which counts sites according to Nei-Gojobori. This is not so bad if base-composition at 3rd positions is not too skewed (etc) and divergence is not high (not a problem for within-population data!).

I have also used PAML (inc. pairwise) to get ML estimates, but this depends on a codon model in which you need to make more subtle decisions about base composition and substitution models, and what these imply for the mutational-opportunity count of sites (Paul's point). And you need to create interim trees and alignments, which is a bit silly in the age of the VCF!

My reason for asking is that
(1) It would be convenient to start with a VCF and a GFF/GTF
(2) For 4fold and 0fold I've never decided what it is best to do with codons that are segregating for more than two positions (Simon's problem)

Thanks all!

D

--

Darren Obbard
darren...@ed.ac.uk

Institute of Evolutionary Biology
University of Edinburgh
Ashworth Laboratories, Charlotte Auerbach Road
Edinburgh EH9 3FL

Office 0131 651 7781
Mobile: 07968 838 635

http://obbard.bio.ed.ac.uk/

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.

OBBARD Darren

unread,

Mar 15, 2022, 5:29:14 AM3/15/22

to evo...@lists.ed.ac.uk, ashworth-c...@googlegroups.com

[Deborah]
>4x degenerate sites are generally (and. I’d think, properly) defined at the base level

[Brian]
> Surely, a 4-fold site is a 3rd coding position where all mutations are synonymous;

Taking these two together: you can only know if all the mutations are synonymous *if* you know the codon the third position is in.

So I think fourfold sites cannot be defined at a per-base level, as you have to refer to the codon the sites appears in; i.e. you cannot say a site is fourfold if you do not know the state of the other bases in the codon.

So I think you need to know phase in order to know the state of the codon.

This gives you a problem in Sanger sequencing non-haploids, as you can't know phase without additional inference.

For Illumina you *could* in principle have phase, and thus the codon state, but that information is not present in VCFs, so again you need some additional inference

Some counting methods ignore codons with multiple SNPS in them - but this could lead to biases (since mutations and selection could be autocorrelated, especially if you are interested in K rather than pi. Or they (often) just base it on the reference sequence - but of course that is an arbitrary choice!

D
--
Darren Obbard
darren...@ed.ac.uk

Institute of Evolutionary Biology
University of Edinburgh

Room 2.09, Ashworth 2, Charlotte Auerbach Road

Edinburgh EH9 3FL

Office 0131 651 7781
Mobile: 07968 838 635

http://obbard.bio.ed.ac.uk/

-------------------------------------------------------------------

> -----Original Message-----
> From: CHARLESWORTH Brian
> Sent: 15 March 2022 09:06
> To: OBBARD Darren <darren...@ed.ac.uk>
> Subject: Re: denominators for pi_s and pi_4fold - Answer summary
>
> There seems to be some conceptual confusion here. Surely, a 4-fold site is a 3rd
> coding position where all mutations are synonymous; if you have a list of these,
> you just use these sites for analysing polymorphism data and ignore other sites
> (i.e. your denominator for pi etc is just the total number of 4-fold sites in the
> dataset). It seems that some people are doing codon-based estimates, but
> variability is normally expressed on a per site not per codon basis.
>
>
>
> On 14 Mar 2022, at 22:04, OBBARD Darren <darren...@ed.ac.uk

> darren...@ed.ac.uk <mailto:darren...@ed.ac.uk>

>
> Institute of Evolutionary Biology
> University of Edinburgh
> Ashworth Laboratories, Charlotte Auerbach Road
> Edinburgh EH9 3FL
>
> Office 0131 651 7781
> Mobile: 07968 838 635
>
> http://obbard.bio.ed.ac.uk/
>
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336. Is e buidheann carthannais a th’ ann an
> Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
>
>
>

> Brian Charlesworth
>
> Brian.Cha...@ed.ac.uk <mailto:Brian.Cha...@ed.ac.uk>

>
> Institute of Evolutionary Biology
> University of Edinburgh

> Charlotte Auerbach Road
> Edinburgh EH9 3FL

> UK
>
> (44)-0131-6505751 (office)
> 0131 662 1725 (home)

Reply all

Reply to author

Forward

0 new messages