Using msprime to simulate initial populations with multiple chromosome in SLiM 5.0

47 views
Skip to first unread message

Kamolphat Atsawawaranunt

unread,
May 18, 2025, 7:22:58 PMMay 18
to slim-discuss
Hi Ben,

I am not sure if this is the right place for this question, but I hope that you can point me in the correct direction.

I was hoping to first generate standing genetic diversity using msprime and then use this to start a SLiM simulation in SLiM 5.0 with multiple chromosomes (i.e., use the initializeChromosome() function).

For SLiM 4.3, the simulated chromosomes (one long chromosome with altered recombination rates) from msprime 1.3.3 seems to work. However, how do I achieve this using msprime for SLiM 5.0 if I would like to use initializeChromosome()?

I noticed that there has been some changes in the way treesequence file is stored, so I have tried to mimic this by storing the tree sequence file from msprime for each chromosome (e.g., chromosome_1.trees, chromosome_2.trees) in a folder with a ".trees" suffix (e.g., test.trees). However, when I tried to read these into SLiM, I get the following error:

ERROR (Species::ReadTreeSequenceMetadata): the version of this file appears to be too old to be read, or the file is corrupted; you can try using pyslim to bring an old file version forward to the current version, or generate a new file with the current version of SLiM or pyslim.

I understand that I can still use the old method for SLiM 5.0, but I thought that it would be nice to transition to the newer function. Are there significant computation efficiency gains through the use of initializeChromosome() over the old methods of altering recombination rates?

Yours sincerely,
A

Ben Haller

unread,
May 18, 2025, 7:37:37 PMMay 18
to Kamolphat Atsawawaranunt, slim-discuss
Hi A!

You're on the cutting edge!  Your trees archive, that you've assembled by hand with msprime .trees files, does not have the metadata that is currently required.  You can read about those metadata requirements in chapter 29 of the manual.  You could also generate a trees archive from a SLiM simulation that is configured the way you want, and examine the metadata produced by SLiM from that simulation, and if it matches your msprime simulation closely enough, you could probably even copy that metadata over.  :-O

Normally pyslim would be the tool you'd use to annotate msprime-produced simulations with the appropriate metadata for SLiM.  Unfortunately pyslim is a little behind at the moment, though; the pyslim 1.1b1 version that Peter recently released works with single-chromosome simulations in SLiM 5, but the complete multi-chromosome support needed to do what you're trying to do is not yet there, as I understand it.  Peter is plugging away at it, so it should be along before too terribly long.

So the best thing might just be to be patient and wait for pyslim 1.1, which should make this easy.  But if you're impatient, you can (a) continue with the SLiM 4.3 strategy of modeling "effective chromosomes" with r=0.5 points, and transition over to real chromosomes once the tools are ready (I would expect the performance to be very similar on the SLiM side; on the msprime side, I don't know...), or (b) try to put the correct metadata in yourself, which might just be a matter of copying it over from a saved trees archive from SLiM that is configured identically to your simulation.

I hope this makes sense.  Great to see someone already trying out the new multi-chromosome features; I'm sorry they're not quite there yet.  Happy modeling!

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University


Kamolphat Atsawawaranunt wrote on 5/18/25 7:22 PM:
A --
SLiM forward genetic simulation: http://messerlab.org/slim/
---
You received this message because you are subscribed to the Google Groups "slim-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slim-discuss...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/slim-discuss/856e5882-68e6-4818-a257-bfd975756f92n%40googlegroups.com.

Peter Ralph

unread,
May 18, 2025, 10:01:01 PMMay 18
to Kamolphat Atsawawaranunt, Ben Haller, slim-discuss
Well gee, I thought it'd be a *little* longer before someone was wanting to do this. I'll be working on it slowly over the next few weeks. 

If you are interested in doing it yourself, here's a very rough outline: the main obstacle to assembling this yourself is that the node tables need to be identical; so, you'd append all the non-sample nodes from one tree sequence to the other one, and re-number references to it in that one, tidying up other bookkeeping. Feel free to open a discussion on pyslim if you want to work through doing it.

-peter

From: 'Ben Haller' via slim-discuss <slim-d...@googlegroups.com>
Sent: Sunday, May 18, 2025 4:37 PM
To: Kamolphat Atsawawaranunt <a.kam...@gmail.com>
Cc: slim-discuss <slim-d...@googlegroups.com>
Subject: Re: Using msprime to simulate initial populations with multiple chromosome in SLiM 5.0
 

Kamolphat Atsawawaranunt

unread,
May 18, 2025, 10:39:48 PMMay 18
to slim-discuss
Hi Ben and Peter,

Thank you so much for your reply, and thank you so much for SLiM and pyslim. Peter, I will work on the workarounds so there is no rush or need for this at all. I just thought that it may have been something that was already out there that I am unaware of. I don't think I am skilled enough to manipulate the node tables or the metadata of the tree sequence, so I will patiently wait for future updates or follow the SLiM 4.3 strategy. I thought it was neat to simulate generate standing genetic diversity with msprime, assuring coalescence so I have looked at this. However, my models are actually quite small and I was not intending to run multiple replicates so it is very achievable with just a long(er) SLiM burn-in (if I really want to keep the multiple-chromosome functionality).

Yours sincerely,
A
Message has been deleted

Kamolphat Atsawawaranunt

unread,
May 21, 2025, 8:59:02 PMMay 21
to slim-discuss
Hi Ben,
I was not sure if I should start a new thread or not, but I thought that it is somewhat related so I have posted it here.

I was trying the new initializeChromosome function. However, I noticed significantly slower speeds and more memory usage when running the simulation with initializeChromosome() instead of the old strategy of modelling effective chromosomes. For the exact code I ran (see below), using initializeChromosome() was more than three times slower and required more than three times memory than using the old strategy of modelling effective chromosomes. Am I doing something wrong here?

Example code of when I used intializeChromosome() strategy (took me 40.513 CPU s, 103.9 MB memory):

initialize() {

// Initialize nonWF model

initializeSLiMModelType("nonWF"); // Non-WF model

// Define parameters

defineConstant("gen", 500); // number of generations

defineConstant("chrom_len", 5e4);

defineConstant("C_num_QTN", 10); // number of chromosomes with selection (for later part of the simulation)

defineConstant("C_num_Nut", 10); // number of chromosomes with only neutral SNPs

defineConstant("C_num", C_num_QTN + C_num_Nut); // number of chromosomes in total

defineConstant("C_lengths", rep(chrom_len, C_num)); // lengths for each chromosome

defineConstant("R", 1e-5); // recombination rate

defineConstant("MU_base", 1e-7); // base mutation rate overall

defineConstant("N", 4000); // Total size

// Initialize other parameters

initializeSex("A");

initializeMutationType("m1", 0.5, "f", 0.0);

m1.convertToSubstitution = T;


initializeGenomicElementType("g1", m1, 1.0);

initializeGenomicElementType("g2", m1, 1.0); // Initialise separate genomic element type for now, QTL to be added

// Initialize chromosomes

for (i in seq(1, C_num_QTN)){

initializeChromosome(i, chrom_len);

initializeGenomicElement(g1, 0, chrom_len-1);

initializeMutationRate(MU_base);

initializeRecombinationRate(R);

}

for (i in seq(C_num_QTN + 1, C_num)){

initializeChromosome(i, chrom_len);

initializeGenomicElement(g2, 0, chrom_len-1);

initializeMutationRate(MU_base);

initializeRecombinationRate(R);

}

defineConstant("K", 500); // carrying capacity


}


1:gen reproduction() {

subpops=sim.subpopulations;

for (pop in subpops){

// parents are chosen proportional to fitness

female=which(pop.individuals.sex=="F");

male=which(pop.individuals.sex=="M");

parents1 = sample(pop.individuals[female], N, replace=T);

parents2 = sample(pop.individuals[male], N, replace=T);

for (i in seqLen(N)){

pop.addCrossed(parents1[i], parents2[i]);

}

self.active = 0; // call the reproduction callback only once

}

}


1:gen survival() {

return (individual.age == 0);

}


1 early() {

// Add initial panmictic pops p0

sim.addSubpop("p0", N);

}



Example code of what I effective chromosomes strategy (took me 9.384 CPU s, 31.3 MB memory):

initialize() {

// Initialize nonWF model

initializeSLiMModelType("nonWF"); // Non-WF model

// Define parameters

defineConstant("gen", 500); // number of generations

defineConstant("chrom_len", 5e4);

defineConstant("C_num_QTN", 10); // number of chromosomes with selection (for later part of the simulation)

defineConstant("C_num_Nut", 10); // number of chromosomes with only neutral SNPs

defineConstant("C_num", C_num_QTN + C_num_Nut); // number of chromosomes in total

defineConstant("C_lengths", rep( chrom_len , C_num)); // lengths for each chromosome

defineConstant("R", 1e-5); // recombination rate

defineConstant("MU_base", 1e-7); // base mutation rate overall

defineConstant("N", 4000); // Total size

// Initialize other parameters

initializeMutationRate(MU_base);

initializeSex("A");

initializeMutationType("m1", 0.5, "f", 0.0);

m1.convertToSubstitution = T;


initializeGenomicElementType("g1", m1, 1.0);

initializeGenomicElementType("g2", m1, 1.0); // Initialise separate genomic element type for now, QTL to be added

// Initialize genome

defineConstant("g2_start", sum(C_lengths[0:(C_num_QTN - 1)]));

defineConstant("gen_len", sum(C_lengths[0:(C_num - 1)]) - 1);

initializeGenomicElement(g1, 0, g2_start-1);

initializeGenomicElement(g2, g2_start, gen_len-1);

// C_num chromosomes of 50 kb each

rates = c(rep(c(R, 0.5), C_num-1), R);

ends = repEach(cumSum(C_lengths), 2);

ends = ends[0:(length(ends) - 2)];

ends = ends - c(rep(c(1,0), C_num-1), 1);

initializeRecombinationRate(rates, ends);

defineConstant("K", 500); // carrying capacity


}


1:gen reproduction() {

subpops=sim.subpopulations;

for (pop in subpops){

// parents are chosen proportional to fitness

female=which(pop.individuals.sex=="F");

male=which(pop.individuals.sex=="M");

parents1 = sample(pop.individuals[female], N, replace=T);

parents2 = sample(pop.individuals[male], N, replace=T);

for (i in seqLen(N)){

pop.addCrossed(parents1[i], parents2[i]);

}

self.active = 0; // call the reproduction callback only once

}

}


1:gen survival() {

return (individual.age == 0);

}


1 early() {

// Add initial panmictic pops p0

sim.addSubpop("p0", N);

}


Yours sincerely,
A

Ben Haller

unread,
May 21, 2025, 9:56:09 PMMay 21
to Kamolphat Atsawawaranunt, slim-discuss
Hi A!

OK, complicated.  I don't think there's anything wrong with your model, at least not that I've noticed so far.  What's happening is, I think, a bit complicated.  Basically, your model is positioned at an unfavorable spot in the parameter space.  SLiM tends to be tuned to make more biologically realistic models run faster, and to make bigger models run faster.  If a model is biologically unrealistic, I don't focus on it as much in my optimization work, because I figure people won't want to run such models anyway; and if a model is small enough to run fairly quickly, I don't worry about it as much since it's already reasonably fast.  I focus primarily on making the really big models run faster, because that's where people really hit the wall in terms of the runtime of the model limiting what they're able to simulate.

The key parameter here, I think, is the chromosome length, 5e4.  I've aimed at making longer chromosome lengths, which are more biologically realistic for most organisms, and which have longer runtimes, run as fast as possible.  If you kick the chromosome length up to 5e6, the multi-chromosome version of the model actually runs *faster*, in my quick tests just now; 304.7 seconds versus 349.5 seconds for the single-chromosome version.  Of course you'd have to run lots of replicates, etc., to get a firm result; I suspect this result would hold when averaged across many runs of the model (I saw similar gains in many tests I ran during development), but I haven't tried it with your model.  The point is, SLiM 5 with multiple chromosomes can be *faster* for large, biologically realistic models than SLiM 4 was, and than a single-chromosome model would be, because that's where I focused my optimization efforts; the underlying architecture does have some significant new optimizations that were not present in SLiM 4.

But interestingly, making the chromosome length *shorter* than your model also makes the multi-chromosome version perform better!  If I make it 5e1 (i.e., 50), the multi-chromosome version runs in 2.53 and the single-chromosome version in 2.14 (for one test run); so the multi-chrom version is slower, but by much less wide of a margin than you observed.  What's going on there?

That's less a result of deliberate optimization, and more just luck.  I *think* what is happening with your model has to do with "mutation runs", a genetic structure used under the hood to capture haplotype structure and leverage it to make runs faster – basically, SLiM can detect that haplotype structure exists, and reduce the amount of computation needed by doing some calculations just once, for a given haplotype, and reusing that computation for all occurrences of the haplotype across the model.  I *think* the reason that your chromosome length of 5e4 performs so much better as a single-chromosome model is perhaps that, at that length scale, with the particular mutation rate and recombination rate you're using, haplotype structure actually extends across more than one chromosome to a useful degree – calculations can be shared, not just within a single chromosome, but across multiple "effective chromosomes" in the single-chromosome version of the model.  That is not possible with SLiM's architecture, in the multi-chromosome version of the model; each chromosome is tracking haplotype structure independently.  So the single-chromosome version of the model gets a much bigger boost from that optimization.  And that happens particularly around the length that you chose, of 5e4; the effect of it is much less pronounced both at longer chromosome lengths (where the odds of a haplotype usefully spanning more than one chromosome diminish) and at shorter lengths (where the complexity of those haplotype-based calculations probably makes SLiM a little bit slower than it would be if it just ignored haplotypes and claculated everything by brute force).

That's just a guess.  The internals of SLiM are complex enough that things like this are difficult to analyze, and depend upon parameter values as well as your particular hardware architecture, operating system platform, etc.  Things like the size of the CPU cache can make a big difference to how well a particular model performs.  It's really hard to make a solid prediction.  But the take-home point, I guess, is: if performance is really important to you, then try writing your model in several different ways, and see which performs the best.  But you have to be careful – one version might perform the best on your own work machine, but a different version might perform the best on the computing cluster you use for your production runs!  Also, you have to be careful – one version might perform the best with a chromosome length of 5e4, but a different version might perform the best with a length of 5e6!  There's not really anything to do but experiment and measure, and you have to do that with the final parameterization of the model, on the final hardware you intend to run on, otherwise it might not be worth much.  Performance optimization is a complex topic.  :->


Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University


Kamolphat Atsawawaranunt wrote on 5/21/25 7:49 PM:
Reply all
Reply to author
Forward
0 new messages