Adding initial allele frequencies without recapitation

101 views
Skip to first unread message

Francesc Ganau

unread,
Aug 8, 2023, 2:36:06 PM8/8/23
to slim-d...@googlegroups.com

Hello,

I'm currently researching the fluctuations of allele frequencies during polygenic adaptation, with a focus on varying selection coefficients across different loci. My objective is for all loci to be biallelic, akin to SNPs, at the population level. Thanks to prior guidance from this mailing list, I've successfully made all sites biallelic. However, I'm now facing challenges when attempting to initiate simulations with allele frequencies similar to those that would be expected under neutrality.

To sidestep the burn-in phase, I tried using recapitation with PySLIM. But, as far as I have been able to find, PySLIM recapitation results in multiple alleles for each locus — a scenario I'd like to avoid. When I tried to retain only one mutation per site (recapitating first and removing generated mutations from the tree afterwards), I ran into a slew of issues concerning metadata and tree sequence outputs. After numerous days of unsuccessful attempts, I've decided to forgo using PySLIM for this purpose.

It's worth noting that I'm not aiming for the initial population to be in perfect mutation-drift equilibrium. Given the biallelic nature of the sites in my model, a rough approximation of the allele frequency distribution under neutrality would work well enough. This, in theory, roughly aligns with a negative exponential distribution. Here's a bit from my code wherein I tried to introduce mutations in SLiM, drawing frequencies from this distribution:

for (locus in seqLen(ChromosomeLength)) {

    derived_allele_freq = rexp(1);

    num_derived = asInteger(round(derived_allele_freq * NePop));

    num_derived = max(min(num_derived, NePop), 0);

    if (num_derived > 0) {

        derived_allele_individuals = sample(p1.individuals, num_derived);

        derived_allele_individuals.genomes.addNewMutation(mutation_type, locus);

    }

}

 

Unfortunately, the code doesn't produce the desired result, whether I use AddNewMutation or AddNewDrawnMutation. I'm seeking guidance on how to introduce these initial allele frequencies in SLiM without recapitation. It's vital for my model that every locus remains biallelic within the population, regardless of whether they are IBS or IBD. If there's a feasible way to recapitate and still ensure biallelic status for all sites, I'd welcome that solution as well.

Thank you for your unwavering support and the invaluable insights this mailing list consistently offers. It is honestly incredible.

Best regards,

Francesc

 

 

Ben Haller

unread,
Aug 8, 2023, 3:06:03 PM8/8/23
to Francesc Ganau, slim-d...@googlegroups.com
Hi Francesc,

To kick things off, before I ponder this more: you write that "the code doesn't produce the desired result".  In what way?  I.e., how exactly does it differ from the desired result, what does it do instead, and what evidence do you have that that is what it does instead?  (Sometimes the problem is not actually in the code at all, but in the analysis that follows – maybe the code is right and the analysis is wrong...?)

Also, can you please post a full, minimal model that encapsulates the code you gave, and that exhibits the problem?  This makes it much easier to reproduce the behavior, while knowing that I am not doing something different from what you did on your end.

Thanks!

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University


Francesc Ganau wrote on 8/8/23 2:35 PM:
--
SLiM forward genetic simulation: http://messerlab.org/slim/
---
You received this message because you are subscribed to the Google Groups "slim-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slim-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/slim-discuss/CAH8u8PrCxw49bMmB6AFfi%3DnaZajGD5GkiSR2_WzhHTS8KSpTVg%40mail.gmail.com.

Francesc Ganau

unread,
Aug 16, 2023, 4:32:43 PM8/16/23
to Ben Haller, slim-d...@googlegroups.com

Hello,

I apologize for any confusion in my earlier post. It can indeed be challenging to explain complex ideas concisely in text. In a nutshell, I work on the simulation of polygenic selection on standing variation in humans. Specifically, I am interested in understanding the dynamics of allele frequencies a based on their initial state and selection coefficients.

I think I have finally managed to make it work, but I would highly appreciate your opinion to see if there is something wrong with my approach that may distort my analysis down the line…

 

The model I aim for consists of:

·      Population Initialization: A single population is created with L independent, biallelic sites. The recombination rate is set at 0.5. Each site will essentially have one mutation at the population level — no mutation in an individual's genome represents the "ancestral" allele, while a mutation represents the "derived" allele.

·      Randomized Selection Coefficients: Mutations are assigned random selection coefficients at each independent site. The first half of the chromosome carries 'm1' mutations, which are predominantly neutral QTLs (with minor variations in s values around 0). On the other hand, 'm2' mutations have significantly larger s values.

·      Derived Allele Initialization: Derived alleles or mutations start at random frequencies. This initial distribution mirrors what we'd expect at mutation-drift equilibrium under neutrality. Given the independence of loci, there's no need for burn-in or recapitation — which, as previously discussed, posed challenges when enforcing loci to be biallelic using PySLIM. Instead, I've opted for assigning initial allele frequencies based on the anticipated distribution.

·      Frequency Allocation: Each genome site is assigned a random allele frequency from a negative exponential distribution. This isn't the actual expected distribution but serves as a placeholder for testing. For instance, if we draw a 0.4 frequency for site A, we would randomly allocate the mutation to 40% of the population's genomes at that site, considering its corresponding selection coefficient. This allocation is replicated across all L sites.

A simplified version of my attempted code looks like:

----------------------------

 initialize() {


defineConstant("NePop", 200); // An example value, adjust as needed
defineConstant("GenSplit", 2); // The of generations under neutrality before selection. For now, I don't want any, so just 2
defineConstant("GenFinal", 300); // Nº of generations to be run
defineConstant("CL", 1000); // Nº of sites involved int the polygenic phenotype
defineConstant("RecRate", 0.5); // I don't want any new mutations after the initial ones are given
defineConstant("SSd", 0.5); // The standard deviation of the fitness distribution frowm which selection coeffs are drawn
defineConstant("SMean", 1); // The mean of the selection coeff distn
defineConstant("SmallQTLFactor", 1); // The factor by which small QTLs are scaled, renamed for consistency
defineConstant("BigQTLFactor", 10); // The factor for large QTLs, renamed for consistency


defineConstant("MutRate", 1e-14); //In principle, no new mutations should appear.

// Define genomic element proportions for 2 types
defineConstant("g1end", asInteger(CL / 2));
defineConstant("g2start", g1end + 1);
defineConstant("g2end", CL);
//Defining the STD DEVS of the fitness distributions for large and small qtlS
defineConstant("SmallQTLSSd", SSd * SmallQTLFactor);
defineConstant("BigQTLSSd", SSd * BigQTLFactor);
initializeMutationType("m1", 0.5, "n", SMean, SmallQTLSSd);
initializeMutationType("m2", 0.5, "n", SMean, BigQTLSSd); // m4 is now renamed to m2
//I a mutation gets fixed I want the output VCF to reflect this rather than removing it
m1.convertToSubstitution = F;
m2.convertToSubstitution = F;

initializeGenomicElementType("g1", m1, 1);
initializeGenomicElementType("g2", m2, 1);

initializeGenomicElement(g1, 0, g1end);
initializeGenomicElement(g2, g2start, g2end);

initializeMutationRate(MutRate);
initializeRecombinationRate(RecRate);
}


fitness(m1) {return 1.0; }
fitness(m2) { return 1.0; }


//The bit of code to ensure all sites are biallelic
mutation() {
pos = sim.mutations.position;
if (any(pos == mut.position))
return F;
return T;
}

1 {

sim.addSubpop("p1", NePop);

mutation_types = c(m1, m2);

// Generate derived allele frequencies
for (locus in seqLen(CL)) {
if (locus <= g1end) {
mutation_type = m1;
} else {
mutation_type = m2;
}

//Defining the distribution from which per-site allele frequencies will be drawn.
//The "1" is a random number, this is one of the parameters I am keen on exploring.
derived_allele_freq = rexp(1);

num_derived = asInteger(round(derived_allele_freq * NePop)); // of individuals sampled
num_derived = max(min(num_derived, NePop), 0); //Ensuring the number does not exceed Ne bounds

if (num_derived > 0) {
derived_allele_individuals = sample(p1.individuals, num_derived);

//The way I found to implement this is to sample an idividual and give it a mutation. This mutation is afterwards copied to other individuals to match the desired alelee frequency.
chosen_individual = sample(derived_allele_individuals, 1);
chosen_genome_index = sample(0:1, 1);
chosen_genome = chosen_individual.genomes[chosen_genome_index];

new_mutation = chosen_genome.addNewDrawnMutation(mutation_type, locus);


for (ind in derived_allele_individuals) {
for (genome in ind.genomes) {
genome.addMutations(new_mutation);
}
}
}
}

sim.rescheduleScriptBlock(s1, start=2, end=GenSplit);
sim.rescheduleScriptBlock(s1, start=GenSplit, end=GenFinal);

}
//The numbers next to s1 and s2 when running blocks are just placeholders. I noticed it does not work with s1 and s2 alone, even if the times used are actually the ones in s1 and s2.


s1 10000 late() {
if (sim.generation % 50 == 0) {
print(sim.generation);
}
inds1 = p1.individuals;
p1_phenotypes_m1 = inds1.sumOfMutationsOfType(m1);
p1_phenotypes_m2 = inds1.sumOfMutationsOfType(m2); // Adjusted
//Calculating fitnesss
inds1.fitnessScaling = p1_phenotypes_m1 + p1_phenotypes_m2;
}

s2 1500 {
print("DONE!");
//p1.individuals.genomes.outputVCF(filePath=SLIMOutVCF_Select);
}


---------------------------

The interesting bit is in the 1 {} block.

I hope it is much clearer what I am trying to do. Given I need to run a large nº of simulations and the fact that loci are independent allows me to simplify things, this approach is probably more useful than “classic” recapitation, Yet, let me please know if you think it could be made more robust or improved to prevent future mistakes down the line.

 

Thank you so much and apologies again if you find it still unclear,

 

Francesc


Missatge de Ben Haller <bha...@mac.com> del dia dt., 8 d’ag. 2023 a les 21:04:

Ben Haller

unread,
Aug 17, 2023, 10:43:27 AM8/17/23
to Francesc Ganau, slim-d...@googlegroups.com
Hi Francesc!

If CL is the number of sites you want, as your comment says, then your chromosome positions ought to range from 0 to CL-1, not 0 to CL (which results in CL+1 sites).

SMean of 1 seems surprising.  I see later in your code that this is a quantitative genetics model, with additive effects.  Still – do you really want the mean effect to be 1.0, not 0.0?  You say that the m1 mutations are supposed to be nearly neutral, but with a mean effect of 1.0, that does not seem to me to be the case...?

Speaking of quantitative genetics models, it is unusual to calculate the additive effects of alleles and then use that directly as a fitness effect.  Normally, the sum of the additive effects is taken to be a phenotypic trait value, and then fitness is some function of phenotype.  The fitness model you've implemented here is unusual – not a normal popgen "selection coefficient" model, and also not a normal "quantitative genetics" model.  Perhaps that is what you intend, but I don't understand the biological meaning of your implementation.  :->  I would advise that you add a print() of the fitness values you are calculating, and make sure that they are what you expect them to be.  Note that there are lots of examples of quant-gen models in the manual; if you're not sure what I'm talking about, I'd advise looking at them.

You write "In principle, no new mutations should appear", but you set the mutation rate to 1e-14.  Why not just set it to 0.0?  If you don't expect/want new mutations, then just rule them out with a rate of 0.0, and then you don't have to worry about the small but non-zero probability that one has occurred, and you can remove the mutation() callback.

Your two calls to rescheduleScriptBlock() both reschedule s1, and s2 does not get rescheduled.  Is that what you intend?

I would not implement the mutation addition the way you have; it's inefficient and hard to read.  You can simply do:

   derived_allele_individuals.genomes.addNewDrawnMutation(mutation_type, locus);

That adds a new drawn mutation to all the genomes of the chosen individuals, making them homozygous (which seems to be what your code does at present).  It is much more efficient than your implementation.  If you wanted to add the mutation at a given frequency across all genomes, not across all individuals (i.e., resulting in Hardy-Weinberg equilibrium, rather than homozygous individuals and "nullizygous" individuals), you would just call sample() on p1.individuals.genomes rather than p1.individuals, adjusting for factors of 2 and such as needed.  (H-W equilibrium seems like a more natural starting point...?)

That's what I see in a quick review of your code, but of course I may have made errors, misunderstood what you're trying to do, etc.; you cannot rely upon a review like this for much.  (And frankly, I usually don't do model reviews on here at all, so please, don't everybody start posting your models here and asking me to review them for you! :->).  You need to check and validate and debug your model, regardless of what I have suggested above.  Good luck, and happy modeling!


Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University


Francesc Ganau wrote on 8/16/23 4:32 PM:
Reply all
Reply to author
Forward
0 new messages