Microsatellite simulation using a transition matrix

15 views
Skip to first unread message

Luis Fernández

unread,
Apr 25, 2024, 7:01:52 PMApr 25
to slim-discuss
Hi Ben and Peter

I've been using Slim to simulate the mutational dynamic followed by microsatellites. Instead of a simple increase or decrease by one-step unit, I've implemented a transition matrix. This matrix defines the probabilities for each microsatellite to increase or decrease by a certain number of steps, which could be larger than one, or to remain in the same allelic state. To determine these probabilities, I sample from a weight vector that reflects the likelihood of each microsatellite being in the ancestral state, considering that the ancestral states are unknown.

My simulations involve modeling the demographic history of the human-pan group, which includes humans, chimpanzees, and bonobos. Given the large population sizes and generation times, each branch of the simulation is executed in parallel.

I'm currently concerned about the computational resources required to run these simulations, particularly the time and memory usage. Is it possible that there are opportunities to optimize my code for efficiency or to explore alternative approaches such as incorporating a rescaling factor?

Looking for insights into the implementation of the transition matrix method as well.

This is my slim script and the parameters I'm using.

simulation_human_pan.slim
parameters.make

Ben Haller

unread,
Apr 26, 2024, 4:58:13 AMApr 26
to Luis Fernández, slim-discuss
Hi Luis!

I wasn't able to run your model, since the .csv files required weren't provided, so I just looked at its code a bit.

I don't really understand the modifyChild() code – it's rather complex!  But overall, I'd have a couple of comments:

- If you want to make a model run faster, step one is to produce a profile of the model, which you can do either in SLiMgui (easier but less accurate) or at the command line (more complicated but more accurate); see sections 21.5 and 21.7, and 21.6 regarding profiling memory usage as well.  Start with a profile so you don't waste time optimizing code that isn't even a performance bottleneck.

- If the modifyChild() callback does turn out to be a bottleneck, you'll be able to see which lines are the hotspots.  Then look for shared calculations that are done many times (once per microsat, per call to the callback, for example) and could be done just once.  If there are things you could precalculate, that would be helpful.

- Overall, this design with the modifyChild() callback that loops over each microsat in each genome, one by one, doesn't benefit from any vectorization at all, so it probably spends a lot of time in script (but again, profiling should be used first, to confirm that guess, before doing any optimization work).  You could switch to a more vectorized design and perhaps get a substantial speedup.  For example, you could handle all of the microsat mutations in a late() event, in bulk.  Perhaps you would find all microsats with a repeat count of N, and process all of them in the same way (since I guess their transition probabilities would all be the same).  This would allow calls like "sample(probabilities, 1, weights=roots)" to become vectorized, making many draws from the same weighted distribution instead of doing those draws one at a time, which would provide a potentially large speed advantage.  Obviously this is a major redesign and the resulting code will be more complex than the simple loop-based design you presently have; whether it's worth the effort is up to you.  :->

- Similarly, you do the work of uniquing mutations and making new mutations one at a time, with a ton of repeated work.  For every microsat in every genome in every tick, you are calling "sim.mutationsOfType(m2)" to get the microsat mutations again, for just one example.  One useful technique is that nowadays SLiM allows you to keep a permanent reference to a mutation object in a global constant (see section 9.9, for example).  You might consider a model design in which you create a permanent microsat mutation with each possible count, for each position, at the start of the model run.  (You'd do this in a "dummy" subpopulation that you would then kill, before the real simulation begins.)  You could then have, for example, a global constant that is a matrix (row, column) of all of the possible microsat mutations, indexed by position (row) and repeat count (column), or some such.  Many possibilities for this sort of thing, which would probably have the potential to make the code more efficient – again, at the price of additional work and code complexity, as is usually the case with optimization work.

There's no easy solution here.  Profile, find hotspots, refactor your code to allow greater vectorization, eliminate duplicated computations of the same things, lather, rinse, repeat.  Optimization work can take up as much time as you're willing to devote to it, pretty much.  :->

As for rescaling, section 5.5 of the SLiM manual has a pretty extensive discussion.  It is far from a panacea, and can introduce serious artifacts into the behavior of a model, so it should be used with extreme caution.  If profiling indicates that your model has good potential for optimization (i.e., is spending most of its time in your script, not in offspring generation and fitness calculation), that is probably the better path.  See also Dabi & Schrider (2024), presently in bioRxiv I think, for some very useful further discussion of rescaling artifacts.  But if the profile shows the modifyChild() work to be minor (which seems possible, given that it is mostly only done for microsats that actually mutate, and the mutation rate is only 0.0001), then probably most of the runtime is inside the SLiM core doing reproduction and fitness calculation, in which case rescaling might be your only option.

Good luck, and happy modeling!

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University


Luis Fernández wrote on 4/25/24 7:01 PM:
--
SLiM forward genetic simulation: http://messerlab.org/slim/
---
You received this message because you are subscribed to the Google Groups "slim-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slim-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/slim-discuss/54d8955c-32e5-46e3-a816-3d369f3f63den%40googlegroups.com.

Reply all
Reply to author
Forward
0 new messages