Burn-in nucleotide diversity

401 views
Skip to first unread message

Theo Hemmant

unread,
Jun 5, 2023, 12:38:48 PM6/5/23
to slim-discuss
Screenshot 2023-06-05 at 17.37.24.pngScreenshot 2023-06-05 at 17.37.05.png
Hi all,

Following on from a post I made a few weeks ago, I have since produced graphs that show the function of nucleotide diversity (or calcHeterozygosity()) over time. I had expected the graphs to plateau at:

pi = theta/(1+2(theta))

where pi = nucleotide diversity and theta = 4Nmu. However, this did not happen, and instead by 5N generations were plateauing at approximately theta. Additionally, when I used a high mutation rate of 10/mu, nucleotide diversity is exceeding 1, instead plateauing at 1.8.
I was wondering if anyone has seen this before, and if they know what is going on. I have attached my graphs to this post for reference. Thank you for taking the time to read this.

Kind regards,

Theo

Theo Hemmant

unread,
Jun 5, 2023, 12:49:44 PM6/5/23
to slim-discuss
A brief edit is that all values of mu are /4N for all combinations, not /N...

Miguel de Navascués

unread,
Jun 5, 2023, 1:53:55 PM6/5/23
to slim-d...@googlegroups.com
Hi Theo,

If you are simulating a constant population size, the expected value of
pi is theta. Why do you expect pi = theta/(1+2(theta))? What are you
simulating exactly?

Regarding the value of 1.8 for pi with high mutation rates (and
theta=10): my guess is that you are using some staking policy that is
preventing to reach the expected value of 10. That is, you are not using
an "infinite site" model. Without the code that you have used it is
difficult to say much more.

Best,

Miguel




On 05/06/2023 18:49, Theo Hemmant wrote:
> A brief edit is that all values of mu are /4N for all combinations, not
> /N...
> On Monday, June 5, 2023 at 5:38:48 PM UTC+1 Theo Hemmant wrote:
>
> Screenshot 2023-06-05 at 17.37.24.pngScreenshot 2023-06-05 at
> 17.37.05.png
> Hi all,
>
> Following on from a post I made a few weeks ago, I have since
> produced graphs that show the function of nucleotide diversity (or
> calcHeterozygosity()) over time. I had expected the graphs to
> plateau at:
>
> pi = theta/(1+2(theta))
>
> where pi = nucleotide diversity and theta = 4Nmu. However, this did
> not happen, and instead by 5N generations were plateauing at
> approximately theta. Additionally, when I used a high mutation rate
> of 10/mu, nucleotide diversity is exceeding 1, instead plateauing at
> 1.8.
> I was wondering if anyone has seen this before, and if they know
> what is going on. I have attached my graphs to this post for
> reference. Thank you for taking the time to read this.
>
> Kind regards,
>
> Theo
>
> --
> SLiM forward genetic simulation: http://messerlab.org/slim/
> <http://messerlab.org/slim/>
> ---
> You received this message because you are subscribed to the Google
> Groups "slim-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to slim-discuss...@googlegroups.com
> <mailto:slim-discuss...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/slim-discuss/8bbd1d94-8d2e-4b22-9fb5-eb67e4ce5452n%40googlegroups.com <https://groups.google.com/d/msgid/slim-discuss/8bbd1d94-8d2e-4b22-9fb5-eb67e4ce5452n%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Miguel de Navascués

UMR CBGP, INRAE
Centre de Biologie pour la Gestion des Populations
755 avenue du campus Agropolis
CS30016
34988 Montferrier-sur-Lez cedex (France)

miguel.navascues AT inrae.fr
@Miguel_de...@mastodon.online

https://www6.montpellier.inrae.fr/cbgp_eng/Staff/Permanent-staff/de-Navascues

Sign the PCI Manifesto for free open access to scientific publications!
https://peercommunityin.org/pci-manifesto/

Theo Hemmant

unread,
Jun 5, 2023, 6:28:46 PM6/5/23
to slim-discuss
Hi Miguel,

This equation was obtained from (Gillespie, 2004) 'Population Genetics, a concise guide'. We were using this approximation of how nucleotide diversity would behave because for the majority of our simulations we are using high values of theta (1 and 10). 
My project is trying to estimate the effective population size of Anopheles gambiae. I am using SLiM to generate populations which underwent a partial soft sweep and then apply a TMRCA clustering analysis to estimate Ne, then comparing this to what we know it to be based on the SLiM parameters. I'm following on from last year's work, where it was believed that mutation stacking was throwing off the clustering algorithm that had been created. As such, I am running long burn-ins with mutation rates on the command line as I couldn't find a way to properly get rid of stacking after overlaying mutations on PySLiM.
I've attached the code I used for each burn-in below, but I have made multiple scripts varying the parameters, so this is just one example. 
Many thanks,

Theo



initialize() {



// Use tree-sequence recording to speed up burn-in

initializeTreeSeq();


defineConstant('pop_size',100);

//defineConstant('T_N',10*pop_size);

defineConstant('ge_length',70000);

defineConstant('sweep_site',integerDiv((ge_length+1),2));

defineConstant('low_site',sweep_site-1);

defineConstant('high_site',1+sweep_site);


defineConstant('recombination_rate',0.00025);

defineConstant('mut_rate',0.00025);

//defineConstant('T_mu', 10/mut_rate); //round it

defineConstant('nucleotide_diversity', (4*pop_size*mut_rate)/(1+(2*4*pop_size*mut_rate)));

initializeRecombinationRate(recombination_rate);

initializeMutationRate(mut_rate);


// m1 is a neutral mutation

initializeMutationType('m0', 0.5, 'f', 0.0);

m0.mutationStackPolicy = "l";


initializeGenomicElementType('g1', m0, 1.0 );

initializeGenomicElement(g1, 0, low_site);

initializeGenomicElement(g1, sweep_site, sweep_site);

initializeGenomicElement(g1, high_site, ge_length);}



1 early(){

sim.addSubpop("p0", pop_size);

}


1:1000 late(){

div = calcHeterozygosity(p0.genomes);

catn(sim.cycle +","+ div);

}

//conditional termination of somulation based on val of nucleotide diversity

500:39999 late() {

div = calcHeterozygosity(p0.genomes);

if(div >= nucleotide_diversity*1.1){ // if(div >= nucleotide_diversity){

sim.simulationFinished();

catn('number of generations taken ' + sim.cycle);

sim.treeSeqOutput("burnin_no.1_early_10.trees");

}

}



// Run burn-in for ~10/mu generations

40000 late() {

// Output recorded tree-sequence

sim.treeSeqOutput("burnin_no.1._10_.trees");

}


Miguel de Navascués

unread,
Jun 5, 2023, 7:13:05 PM6/5/23
to slim-d...@googlegroups.com
Hi Theo,

I guess you are using the equation for a single site with two alleles
that mutate from one to the other with the same mutation rate (page 54).
For the infinite site model, which would be the appropriate model for
SLiM with mutation stack policy "s", you need equation (2.25) on next
page, which gives you E(pi)=theta. This also works for your model with
stacking policy "l" and low mutation rates. With high mutation rate I am
not sure you can get an off-the-shelf equation... but I do not think you
need it: If you need to do a neutral burn-in in SLiM, it is better to
follow section 17.6 of SLiM manual.

Best,

Miguel

Bhavin Khatri

unread,
Jun 5, 2023, 7:56:58 PM6/5/23
to slim-discuss
Hi Miguel, 

Thanks for your reply. I'm supervising Theo's UG project and I'm a theoretical physicist come population geneticist. The average pair-wise nucleotide diversity cannot exceed 1 (this would need pairs of sequences which have a Hamming distance of greater than L for L sites, which is clearly unphysical). The formula you quote is only true in the limit that θ=4Nμ<<1, the more general expression is obtained, for example, by integrating x(1-x) over the eqm allele frequency distribution in mutation-drift balance (p(x) ~ [x(1-x)]^(θ-1) ), which gives π=θ(1+2θ), and the correct limit of π->1/2 when mutation is very strong (θ>>1). There is a similar expression in Gillespie for a slightly different quantity and it also saturates. In any case, the calculation of the average pair-wise nucleotide diversity within SliM should not give values greater than 1 for physical sequences, and Theo has set stacking to last, so it is not clear what is going on!

As Theo said earlier it is not possible (or in practice easy) to do a neutral burn-in using pyslim, since stacking is the default behaviour, and needs a way of removing these post-hoc — hence the brute-force forward simulations for burn-in. It is absolutely necessary to explore θ~1 and greater, since this is exactly what we find empirically for Anopheles gambiae across sub-Saharan Africa (Khatri & Burt, MBE 2019).

Best, 

Bhavin

Peter Ralph

unread,
Jun 6, 2023, 1:17:31 AM6/6/23
to Bhavin Khatri, slim-discuss
Hm: looking at functionSource("calcHeterozygosity"), the calculation being done under the hood does
length = species.chromosome.lastPosition + 1;
p = genomes.mutationFrequenciesInGenomes(muts);

heterozygosity = 2 * sum(p * (1 - p)) / length;

Consider the case with a single site at which there are n mutations, each at frequency 1/n. Each contributes 2 * 1/n * (1-1/n) to the sum, resulting in a heterozygosity of 2*(1-1/n). I started to file a bug report for this, before reading in the documentation

> The implementation of calcHeterozygosity(), viewable with functionSource(), treats every mutation as independent in the heterozygosity calculations.  One could regard this choice as embodying an infinite-sites interpretation of the segregating mutations.

So, in fact, this is correct as defined, although in an extremely weird way. Effectively, it is computing the mean number of mutations that differ between a randomly chosen pair of sequences, divided by the length of the genome; since with "last" mutation stacking, any two sequences of length L can *each* have L different mutations, they will at maximum differ at 2*L mutations, thus saturating at a heterozygosity of 2.


I have not opened an issue on this, since the right way forward is not currently clear to me: doing the calculation that gets what we expect is substantially less efficient (which IIRC was a motivation in defining it this way). Note that if you use calcPairHeterozygosity() then there is an option that lets you get what you want, which the documentation points to. If someone has a concrete proposal, then they should open an issue. (and, maybe this warrants a note in the documentation, although "last" is not a common stacking policy)

Also note that if you calculate heterozygosity from the tree sequence, you'll get what you expect.

On another note:

> it is not possible (or in practice easy) to do a neutral burn-in using pyslim, since stacking is the default behaviour, and needs a way of removing these post-hoc

This could be done in not many lines of python - if you'd like to do it, open an issue on pyslim?

** peter


--
SLiM forward genetic simulation: http://messerlab.org/slim/
---
You received this message because you are subscribed to the Google Groups "slim-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slim-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/slim-discuss/11dcb3fd-52c0-4f5b-ac0e-0082a867be08n%40googlegroups.com.

Bhavin Khatri

unread,
Jun 6, 2023, 6:05:27 AM6/6/23
to slim-discuss
Hi Peter, 

Thanks for those clarifications and pointers. I definitely think it is worth highlighting some of these points, to make clear in the documentation, for which regimes of θ different functions are appropriate to empirically calculate π — but understand that most of SliM is geared to θ<<1, where these problems/differences are less important. 

I don't personally use SliM or pyslim, so can't comment on specifics, but maybe Theo can open an issue on pyslim.

Cheers

Bhavin

Chase W. Nelson

unread,
Nov 16, 2023, 3:06:21 AM11/16/23
to slim-discuss
Peter and all, thanks so much for this discussion! I'm hoping to verify understanding of calcHeterozygosity(). To do this, I constructed a toy example of 3 sequences, named A, B, and C, limiting to biallelism. I'm showing two ways I think I would calculate the 'empirical' value of π (nucleotide diversity), and then how I think SLiM's function does it (expected heterozygosity). Would you be willing to confirm whether this understanding is in/correct?

pi calculation.png

Additionally, I note that I receive an error when trying to examine the source in the Eidos Console (SLiM version 4.0.1; Eidos version 3.0.1):

> functionSource("calcHeterozygosity")
No function found for "calcHeterozygosity".

Thanks!
Chase

Peter Ralph

unread,
Nov 16, 2023, 9:11:16 AM11/16/23
to Chase W. Nelson, slim-discuss
Ah ha - I think what you're showing there is a different, um, feature of the calculation than what we were discussing above. Recall that heterozygosity is "proportion of differences between two randomly chosen sequences". Your first two calculations (that get pi=0.27) do the calculation without​ replacement (ie, between two randomly chosen and different​ sequences). The third calculation (that gets pi=0.18) is with​ replacement. If you're doing this with N sequences, the "with replacement" value will be smaller by a factor of 1-1/N (here, by 2/3).  One might refer to these as the "sample heterozygosity" and the "population heterozygosity"? In general I prefer the "without replacement" version - it is unbiased, for instance - but in big samples, the difference will be tiny.

From: slim-d...@googlegroups.com <slim-d...@googlegroups.com> on behalf of Chase W. Nelson <cwnel...@gmail.com>
Sent: Thursday, November 16, 2023 12:06 AM
To: slim-discuss <slim-d...@googlegroups.com>
Subject: Re: Burn-in nucleotide diversity
 

Ben Haller

unread,
Nov 16, 2023, 9:42:38 AM11/16/23
to Chase W. Nelson, slim-discuss
Hi Chase!

Re: that console error, that's odd; perhaps you had a typo?  Seems to work fine for me.  Anyhow, here are the results of that query, for the current GitHub head SLiM (I don't recall this function changing since 4.0.1 but I could be wrong):

> functionSource("calcHeterozygosity")

(float$)calcHeterozygosity(object<Genome> genomes, [No<Mutation> muts = NULL], [Ni$ start = NULL], [Ni$ end = NULL]) <SLiM>

{

if (genomes.length() == 0)

stop("ERROR (calcHeterozygosity()): genomes must be non-empty.");

if (community.allSpecies.length() > 1)

{

species = unique(genomes.individual.subpopulation.species, preserveOrder=F);

if (species.length() != 1)

stop("ERROR (calcHeterozygosity()): genomes must all belong to the same species.");

if (!isNULL(muts))

if (!all(muts.mutationType.species == species))

stop("ERROR (calcHeterozygosity()): muts must all belong to the same species as genomes.");

}

else

{

species = community.allSpecies;

}

length = species.chromosome.lastPosition + 1;

// handle windowing

if (!isNULL(start) & !isNULL(end))

{

if (start > end)

stop("ERROR (calcHeterozygosity()): start must be less than or equal to end.");

if (isNULL(muts))

muts = species.mutations;

mpos = muts.position;

muts = muts[(mpos >= start) & (mpos <= end)];

length = end - start + 1;

}

else if (!isNULL(start) | !isNULL(end))

{

stop("ERROR (calcHeterozygosity()): start and end must both be NULL or both be non-NULL.");

}

// do the calculation

p = genomes.mutationFrequenciesInGenomes(muts);

heterozygosity = 2 * sum(p * (1 - p)) / length;

return heterozygosity;

}


If you do think there's a bug with functionSource() on your machine, please file an issue on GitHub with details, but I'd be quite surprised!

Regarding your main question: I think I'm going to let the doc for calcHeterozygosity(), and the Eidos code for it, stand on their own.  The doc was written quite carefully, and I suspect any attempt by me to reword/embellish upon it would only muddy the waters.  :->  But perhaps Peter or Philipp or someone can confirm your understanding.  I find the jargon surrounding these calculations, and the many ways of performing them, quite confusing.

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University



Chase W. Nelson wrote on 11/16/23 3:06 AM:

Chase W. Nelson

unread,
Nov 17, 2023, 1:50:36 AM11/17/23
to slim-discuss
Peter and Ben, this makes perfect sense! Thanks a ton for the crystal clear explanation. Regarding the console, the exact same command — copy & pasted! — works today when it hadn't yesterday. I have absolutely no idea. Totally understood on leaving the docs as they are, the last lines of the source make the calculation clear (sweet use of the more efficient ∑2pq rather than 1 - ∑p^2).

I do wonder if a nucleotideDiversity() will be considered in the future!? :-> hard to make it efficient, but I'll raise a GitHub idea!

Appreciate your help so much!
Chase
Reply all
Reply to author
Forward
0 new messages