Soft sweep simulation too slow

139 views
Skip to first unread message

Anushka

unread,
Mar 23, 2022, 5:39:28 PM3/23/22
to slim-discuss
Hello,

I am very new to SLiM and was hoping that someone could help me speed up a script I've written. 

I am trying to simulate a soft sweep in which a neutral mutation becomes slightly beneficial. The aim of this is to then cluster the haplotypes on which the beneficial mutation is found in order to predict the number of origins of the soft sweep and thus the effective population size. 

To do this, I have split my code into two parts. In the first part, I simulate a population of 10,000 individuals for 100,000 generations. There are two mutation types - m1 and m2 - and initially both are neutral. Whilst m1 can occur at any position in the genome, m2 only occurs at a single locus in the genome. In the second part, I simulate the soft sweep by changing the selection coefficient of m2 to be slightly beneficial (i.e. 0.01).

The issue with the first part of this code is that it runs slowly on my computer (32 GB RAM) - it takes about 20 mins to run the first 1000 generations. I initially wanted to write a nucleotide level model but that took even longer.mIs there a way I can change the code to make it run faster? Also, I intend to run this code on my college's HPC and was wondering if increasing the number of cores SLiM uses would help speed it up?

Here is the code for both parts:

PART 1 - evolve population for 100,000 generations

initialize() {

defineConstant('simID', getSeed());

defineConstant('pop_size',10000);

defineConstant('ge_length',99999);

defineConstant('sweep_dom',0.45);

defineConstant('recombination_rate',1e-3);

defineConstant('mut_rate',1/(4*pop_size));

initializeRecombinationRate(recombination_rate);

initializeMutationRate(mut_rate);

 

// Create two neutral mutations. m2 will be the sweep mutation

initializeMutationType('m1', 0.5, 'f', 0.0);

initializeMutationType('m2', sweep_dom, 'f', 0.0);

m2.convertToSubstitution = F; // (is this needed?)

 

// Initialise genomic element with resistance mutation at position 50000

initializeGenomicElementType('g1', m1, 1.0 );

initializeGenomicElementType('g2', c(m1,m2), c(1.0,1.0) );

initializeGenomicElement(g1, 0, 49999);

initializeGenomicElement(g2, 50000, 50000);

initializeGenomicElement(g1, 50001, ge_length); }


1 {

sim.addSubpop('p0', pop_size);

p0.tag = 0;

start = clock(); }

 

 

2:99000 late() {

if (sim.generation % 500 == 0) {

catn(sim.generation);}

if (sim.generation % 3000 == 0){

catn("Elapsed: " + (clock() - start));}

}

 

100000 late() {

sim.outputFull(filePath = "slim_" + simID + ".txt");

sim.simulationFinished();

}

PART 2: Soft sweep

initialize() {

defineConstant('simID', getSeed());

defineConstant('pop_size',10000);

defineConstant('ge_length',99999);

defineConstant('sweep_dom',0.45);

defineConstant('recombination_rate',1e-3);

defineConstant('mut_rate',1/(4*pop_size));

initializeRecombinationRate(recombination_rate);

initializeMutationRate(mut_rate);

 

 

// Create two neutral mutations. m2 will be the sweep mutation

initializeMutationType('m1', 0.5, 'f', 0.0);

initializeMutationType('m2', sweep_dom, 'f', 0.0);

m2.convertToSubstitution = F; // (is this needed?)

 

// Initialise genomic element with resistance mutation at position 50000

initializeGenomicElementType('g1', m1, 1.0 );

initializeGenomicElementType('g2', c(m1,m2), c(1.0,1.0) );

initializeGenomicElement(g1, 0, 49999);

initializeGenomicElement(g2, 50000, 50000);

initializeGenomicElement(g1, 50001, ge_length); }

 

1 late () {

sim.readFromPopulationFile("~/Desktop/slim_" + '2353137633341' + ".txt");

p0.tag = 0;

}

 

2:5000 late() {

// Change selection coefficient of all m2 mutations to 0.01

    sweep = sim.mutationsOfType(m2);

    sweep.setSelectionCoeff(0.01);

    if (p0.tag != sim.countOfMutationsOfType(m2)) {

        if (any(sim.substitutions.mutationType == m2)) {

            catn('HARD sweep ended in gen. ' + sim.generation);

            sim.simulationFinished();}

        else {

            p0.tag = sim.countOfMutationsOfType(m2);

            if (sim.generation % 100 == 0){

                catn('Gen. ' + sim.generation + ': ' + p0.tag + ' lineage(s)');

                catn('m2 frequency across all lineages: ' + sum(sim.mutationFrequencies(p0, sim.mutationsOfType(m2))));}

 

        if ((p0.tag == 0) & (sim.generation > 3000)) {

            catn('Sweep failed to establish');

            sim.simulationFinished();}}}

       if (all(p0.genomes.countOfMutationsOfType(m2) > 0)) {

           catn('SOFT sweep ended in gen. ' + sim.generation);

           catn('Frequencies: ');

           print(sim.mutationFrequencies(p0, sim.mutationsOfType(m2))); // prints mutation frequency for each lineage

           catn(p0.tag + 'lineages') ;

           sample = p0.sampleIndividuals(300).genomes;

           sample.outputVCF("soft_sweep_300_1.vcf");

           sim.simulationFinished();}} 

I would appreciate any guidance or advice.


Warm regards,

Anushka

Ben Haller

unread,
Mar 24, 2022, 9:20:11 PM3/24/22
to slim-discuss
Hi Anushka, sorry for the slow reply.  So, if I understand correctly, the first of the two models you posted is the problem.  It looks like it is a neutral burn-in of:

- 100,000 generations – pretty long
- a mutation rate of 2.5e-05 – quite high!
- a recombination rate of 1e-3 – extremely high!
- a population size of 1e4 (middling)
- a chromosome length of 1e5 (middling)

Unsurprisingly, it spends pretty much all of its time simulating the neutral mutations. After 200 generations, you've already got almost 500,000 mutations segregating, and the model is nowhere near equilibrium yet. With that recombination rate, all those mutations are getting shuffled around and recombined a tremendous amount, too. And so generating offspring is very slow, and 100,000 generations of doing that naturally takes a long time. :->

One question is whether you really want those parameter values. With mutation and recombination rates of 1e-8 (closer to typical empirical values, for many systems, I guess?), the model runs orders of magnitude faster. But let's assume you really do want those values.

In that case, you probably need to look at "tree sequence recording", which lets you run a simulation without including any of the neutral mutations, and overlay them onto the ancestry tree afterwards. See section 1.7 and chapter 17 for more information; in particular, section 17.1 and 17.2 may be of interest. Your recombination rate of 1e-3 may make tree-sequence recording fairly slow, too, however; indeed, it might not be a win at all, since with a recombination rate that high it has to record a huge number of "edges" in the tree sequence representing recombination events. You might therefore want to switch to using the coalescent to do your neutral burn-in, adding in the desired non-neutral diversity to the ancestry tree, and then load the resulting tree sequence into SLiM for your non-neutral simulation. The pyslim documentation has an example, https://tskit.dev/pyslim/docs/latest/vignette_coalescent_diversity.html, that I think does close to exactly what you want to do, and it might be fast enough for your purposes.  If you have questions about that, feel free to ask again on this list.  Good luck!

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University

Ben Haller

unread,
Mar 24, 2022, 9:56:02 PM3/24/22
to slim-discuss
Oh, and – no, SLiM uses one core for any given run, it is not multithreaded.  Making it capable of running multithreaded is a planned project that may come to fruition at some point, but not soon; for software as complex as SLiM, it is a rather complex undertaking.  But if you follow that pyslim vignette and use msprime to do your burn-in, it will likely get to the point where it is acceptably fast; I certainly hope so (and it would be interesting to hear a report back on how that goes for you :->).  If even the coalescent in msprime is not fast enough for your purposes, then you really will probably need to reconsider that recombination rate of 1e-3.  :->

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University


Anushka

unread,
Mar 28, 2022, 6:34:01 AM3/28/22
to slim-discuss
Hello Benjamin,

Thank you very much for your suggestions, I will try lowering the mutation and recombination rates first and see if that helps.

Warm regards,
Anushka

Anushka

unread,
Apr 11, 2022, 11:27:10 AM4/11/22
to slim-discuss
Hello Benjamin,

I followed your suggestions and used a lower recombination rate (1e-6) and tree sequence recording to perform the neutral burn-in. This part of my code (see below) now runs in seconds, instead of hours! Thank you for pointing me in the right direction!

1_burn_in.png
After this, I use pyslim to annotate the tree and set the selection coefficients of all the mutations to 0.

2_annotate.png

Then, I read the annotated tree into slim and simulate a soft sweep of the mutation m2:

4_soft_sweep.png
3_soft_sweep.png

After the soft sweep occurs, I take a sample of individuals and save them in vcf format. However, when I use allel.GenotypeArray to visualise the vcf file, the number of rows (579473) is much larger than the length of the genome in the simulation (40000):

Screenshot 2022-04-11 at 19.19.08.png

I know I've put up a lot of code but I would be really grateful if you could suggest why I'm having this issue.

Warm regards,
Anushka

Peter Ralph

unread,
Apr 11, 2022, 11:56:55 AM4/11/22
to Anushka, slim-discuss
> After the soft sweep occurs, I take a sample of individuals and save them in vcf format. However, when I use allel.GenotypeArray to visualise the vcf file, the number of rows (579473) is much larger than the length of the genome in the simulation (40000):
 
Hi, Anushka! This is because it looks like there's a bunch of mutations at every position in the genome. Let's see - with a population of size 1000 and a mutation rate of 5e-4, we expect about 4 Ne mu = 2 mutations per site (neutrally). The way SLiM records mutations, these will each be separate rows in the output VCF file - see section 26.2.3 in the manual for explanation here.


4_soft_sweep.png
3_soft_sweep.png

 

Anushka

unread,
Apr 11, 2022, 1:39:25 PM4/11/22
to slim-discuss
Hello Peter,

Thank you very much for your reply. I still don't understand why, given that I set the stacking policy to 'l', one individual (e.g. column 498 in the matrix below) can have two different alleles (alleles 0 and 2) at the same position (position 1). Could you please explain that?

Warm regards,
Anushka

 Screenshot 2022-04-11 at 21.34.30.png

Peter Ralph

unread,
Apr 12, 2022, 9:43:03 AM4/12/22
to Anushka, slim-discuss

Thank you very much for your reply. I still don't understand why, given that I set the stacking policy to 'l', one individual (e.g. column 498 in the matrix below) can have two different alleles (alleles 0 and 2) at the same position (position 1). Could you please explain that?


Ah, right - well, I think that's because you've also added mutations using msprime, which only supports the default stacking model of SLiM mutations.

We should probably move additional questions off-list, to avoid filling everyone's inbox, btw.

-- Peter

Ben Haller

unread,
Apr 12, 2022, 12:10:50 PM4/12/22
to Peter Ralph, Anushka, slim-discuss
One more on-list reply for closure; follow up off-list as Peter suggested, please.

So, to resolve the issue you're having, you could (a) use a lower mutation rate to get fewer stacked mutations perhaps, (b) filter out sites with stacked mutations (but with your current mutation rate that might filter out almost all sites!), or (c) post-process the VCF data to determine the final mutation at each site, considering the most recent mutation in any stack to be the only mutation present.  Perhaps (c) is your best option if you need the high mutation rate.  It would be a custom script, in R or Python or C or whatever, that you would need to write, but perhaps it would not be too terribly difficult to get working.  So then Python would still be overlaying the mutations with stacking, but you would essentially re-interpret the final data as if stacking were in effect.  If necessary, you could also possibly do this in Python immediately after mutation overlay, fixing the actual tree sequence, but I have no idea how you'd do that or whether it would be difficult.  Anyhow, if you do write such a script, I hope you will share it on slim-discuss or with a pull request to SLiM-Extras.

I guess that brings us to option (d): dive into the Python code for msprime/pyslim/tskit and add a new mutation stacking policy feature to it, parallel to how it works in SLiM.  If you're up for that, perhaps Peter would have helpful advice regarding where to start.  All of this stuff is open-source software, and is a community effort; you're welcome to join the community!  :->


Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University


Peter Ralph wrote on 4/12/22 9:42 AM:
--
SLiM forward genetic simulation: http://messerlab.org/slim/
---
You received this message because you are subscribed to the Google Groups "slim-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slim-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/slim-discuss/CAO0xWMD5aDT4o3y7e_8W8-HNAgJrunLWYpYVckLSn9HVyZE7sA%40mail.gmail.com.

Reply all
Reply to author
Forward
0 new messages