Non-neutral spatial simulation

7 views
Skip to first unread message

Kamolphat Atsawawaranunt

unread,
Dec 12, 2025, 12:11:02 AM (4 days ago) Dec 12
to slim-discuss
Dear all,

I am trying to run a large non-neutral spatial simulation. After scaling the models with runs with fewer individuals (up to 20000 individuals), it appears that the full model (about 2 million individuals) will likely take too long (> two weeks, the scaling was not linear so estimate is not very accurate) and require a lot of memory (~350 GB).

The main aim is to simulate locally adapted native range populations to two environmental variables, before introducing individuals to new spatial environments, with the final goal to investigate how quantitative trait nucleotides may sort itself in new environments. The environment and the (scaled) genome mimics what I know of the species biology. 

I only simulate non-neutral mutations (QTLs) on a 1MB genome (34 "pseudo-chromosomes" of varying size and recombination rate) and plan to recapitate and overlay with mutations in pyslim. Due to this, I have treesequencing turned on and used pseudo-chromosomes instead of separate chromosomes.

The main feature of the simulations:
  • The genome is scaled in relation to the "observed" values, so the genome is 1000x smaller, with recombination rates and mutation rates 1000x higher.
  • Life-long monogamy
  • Territoriality (mated individuals do not move)
  • QTLs are pleiotropic and contain two effect sizes linked the two environmental variable
  • First 1000 ticks with environmental values of zero with weak stabilising selection to generate standing genetic variation
  • Tick 1001 - 2000 with environmental values slowly shifting from zero to the observed values with stronger stabilising selection
  • Tick 2000 - 3000 with the observed environmental values.
Please find the slim script and the CSV file for the recombination rates attached. The CSV file was generated in a windows computer so it might require dos2unix to work on a unix computer. The slim script was modified to make it more easily reproducible with randomly generated environmental maps, but it has a similar scale to the real run.

Profiling in SLiMgui seems to suggest that the majority of the run time is used in first() (~35%, probably for the life-long monogamy part) and early() (~12%, QTL fitness quantification) and reproduction/offspring generation (~10%), and the majority of the memory usage is spent of tree-sequence tables.

After some testing, the amount of memory used appears to be linearly related to the number of ticks run, and also the number of individuals in the simulation.

I tried making the genome smaller (but having to increase mutation and recombination rates by the same factor) but this does not seem to reduce the memory usage.

Does anyone have any suggestions on how this could be made more efficient? I actually do not need tree sequencing turned on, but then I would have to track many neutral mutations in my simulation.

I scaled the number of the individual in the simulation by increasing the grid cell size (res_scale) in which the density effects are calculated (which will likely create some artifactual spatial structure at this point) and the decrease the dispersal distance (disp_dd) and mate choice interaction (SIGMA_M) by the same amount.

Kamolphat (A)
Predicted_recombination_to_simulate.csv
NativeRangeSpatial_reproduce.slim

Ben Haller

unread,
Dec 12, 2025, 9:27:33 AM (3 days ago) Dec 12
to slim-d...@googlegroups.com
Hi Kamolphat!

Please attach the profile that you took; given the scale of the model, you shouldn't expect us to take a profile of it ourselves.  In general, provide the actual information that you have, rather than trying to just verbally summarize it.  :->

I'm puzzled that you call out three aspects of the profile, but those three aspects barely total more than 50%, so I'd like to see where the other ~50% is spent.  You also say that for first() events it is "probably for the life-long monogamy part", but why "probably"?  The profile should show you the exact lines of code that are hotspots within your events; this should not be a matter of speculation.

Is the profile taken with a smaller version of the model, or at your desired full scale  of 2 million individuals?  You say that the scaling of the model was not linear, so if you're trying to get it to run at full scale, a profile taken at small scale will not be informative regarding where the performance problems are.  As the number of individuals increases, some aspects of a model will typically scale linearly (not a big problem) whereas others will scale quadratically or worse (a big problem).  So you need to figure out what part of the model is not scaling well, and fix that.  Your profile also ought to be taken at a point when mutational density has more or less equilibrated; profiling a model at the very start of a burn-in period is very misleading.  Ideally, you profile across a full run of the model so that your profile captures the totality of what is happening, because as a model runs the location of the hotspots in it can change drastically.

You say that "the majority" of the memory usage is the tree-sequence recording tables, and it sounds like tree-sequence simplification isn't a major part of your runtime (you didn't mention it), so an obvious place to start to get the memory usage down would be to increase the frequency of simplification; did you try that?

Your justification for using pseudo-chromosomes instead of separate chromosomes doesn't really make sense to me.  You might try using separate chromosomes, which might make the tree-sequence recording information smaller since it would then not be struggling to record the free recombination between your 34 chromosomes.  With pseudo-chromosomes it is having to record all of the recombinations between the chromosomes, for every individual it generates, which is probably a huge amount of information and would totally destroy the correlations between adjacent trees in the tree sequence.  I don't know, but that seems probably bad.

You say that you rescaled the genome to be 1000x smaller with a 1000x higher mutation and recombination rate.  That might not end up being a win, since you will have the same number of mutations segregating, just at a higher density.  It might actually slow SLiM down, in fact, since it will have to deal with a lot more mutation stacking, and since its mechanisms for taking advantage of shared haplotype structure might not work as well.  So you might try just not doing that.  :->  My guess is that it probably makes little difference either way, given that the number of mutations remains the same and the density is not that high even after rescaling; but it'd be interesting to check and see.  The more typical model scaling that people do is described in section 5.5 of the SLiM manual, but it probably doesn't work well at all for a spatial model, and would have to be done extremely carefully in that context, so I am not recommending it, just noting it.  People don't generally rescale the genome length in the way you're doing precisely because it doesn't generally end up making much difference, I think.  :->

I also note that although you've rescaled the genome to be 1000x smaller, you still have 1 million sites in your model, all of which are deemed to be potential QTL sites.  From a nucleotide-level perspective (QTNs) perhaps that makes sense; but you could shift to a gene-based perspective and model just perhaps 1000 sites as QTLs, if that would make sense.  With the downscaled version of the model that you provided, after 1000 ticks there are only 400 mutations segregating, but that version of the model appears to have a population size of only about 2500, so if you want to scale up to 2 million, you're going to have a *lot* of QTNs in your model.  Most of those will probably be extremely rare, but all of them will need to be tracked by SLiM, recorded in the tree-sequence tables, etc.  Is that realistic, and do you really need that many QTNs, and that level of genetic detail?  How about maybe a smaller number of causal sites, with perhaps a correspondingly larger distribution of effect sizes, or something like that?  Maybe you need to be simulating what you're simulating, I don't know; but this seems worth thinking about.

You define two spatial maps in your 1 first() event, but you name them both "world", so the second map will replace the first one, I think.  Seems like a bug.

In your first() event you're calling i2.evaluate() after each individual chooses a mate.  That is going to be lethal, performance-wise.  You need a better mate-choice algorithm.

Overall, for a model with 2 million individuals, two traits with pleiotropy, 34 chromosomes, a 1000 MB genome (scaled, but with the same number of mutations, and with every site potentially being causal), spatial, running for 3000 ticks, with tree-sequence recording but without a recapitated burn-in... it's going to take a while, and use a lot of memory.  Such is life.  :-> So while you may be able to speed the model up significantly by working with the profile and adjusting some aspects of it, in the end, resign yourself to starting your runs and then going and working on something else for a while.  Keep in mind that biologists who do fieldwork often have to spend years collecting data for a single experiment – and a lot of that is manual labor, not just letting a computer do the work!  :->  Redesigning your mate-choice algorithm might help a lot; think of a way to design it that only requires a single evaluation of the interaction per tick.  And rethinking how you're handling the genetics might help; I'm not sure exactly what to propose, but it seems like you've got way more causal sites than you probably need.  Beyond that, it's hard to guess further; you need to provide a high-quality profile of the model running at full scale.

Overall, 

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University
Kamolphat (A) --
SLiM forward genetic simulation: http://messerlab.org/slim/
Before posting, please read http://benhaller.com/slim/bugs_and_questions.html
---
You received this message because you are subscribed to the Google Groups "slim-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slim-discuss...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/slim-discuss/d401411e-d3a7-43fa-bcd7-6410043dbaa5n%40googlegroups.com.

Peter Ralph

unread,
Dec 12, 2025, 12:46:07 PM (3 days ago) Dec 12
to slim-d...@googlegroups.com, Ben Haller
Small note: You say the scaling is not linear; that makes me suspect that part of the problem might be the issues around scaling up spatial models that are described, and solved, in https://onlinelibrary.wiley.com/doi/full/10.1002/ece3.71098
But - you should confirm this by profiling.

From: 'Ben Haller' via slim-discuss <slim-d...@googlegroups.com>
Sent: Friday, December 12, 2025 6:27 AM
To: slim-d...@googlegroups.com <slim-d...@googlegroups.com>
Subject: Re: Non-neutral spatial simulation
 
Hi Kamolphat! Please attach the profile that you took; given the scale of the model, you shouldn't expect us to take a profile of it ourselves. In general, provide the actual information that you have, rather than trying to just verbally summarize
ZjQcmQRYFpfptBannerStart
This message originated outside the UO email ecosystem.
Use caution with links and attachments. Learn more about this email warning tag.
 
ZjQcmQRYFpfptBannerEnd

Kamolphat Atsawawaranunt

unread,
Dec 14, 2025, 11:36:52 PM (18 hours ago) Dec 14
to Peter Ralph, slim-d...@googlegroups.com, Ben Haller
Dear Ben and Peter,

Thank you so much for your reply and suggestions and also pointing out the bug in my code. I apologise for not attaching the profile I took. For future reference, is this best done by copying the profile and attaching it as a text file or by a screenshot? Hopefully I will be able to make better diagnoses from the profile report and not make these speculations.

You are completely right that I took the profile on a smaller run with 2500 individuals, and the evaluation of the mate-choice is indeed the biggest bottleneck line of code in first() event. Given all your suggestions, I think I will step back and rethink the genetics required to answer my question and likely the mate-choice algorithm. 

I just profiled the rescaled chromosome size/mutation rates/recombination rates vs non-rescaled ones and you are totally right that it did not improve the simulation time at all.

Regarding the separate chromosome model, is there any information out there on how to recapitate and overlay neutral mutations to treesequence files from multi-chromosome SLiM models? 

Yours sincerely,
A (Kamolphat)

Ben Haller

unread,
5:00 AM (13 hours ago) 5:00 AM
to slim-d...@googlegroups.com
Hi A,
>
> Thank you so much for your reply and suggestions and also pointing out
> the bug in my code. I apologise for not attaching the profile I took.
> For future reference, is this best done by copying the profile and
> attaching it as a text file or by a screenshot? Hopefully I will be
> able to make better diagnoses from the profile report and not make
> these speculations.

To your question, that will likely depend on how you are replying –
whether you answer on the web or by email, and your email client. Just
make sure that the colors in the profile report come through (which they
would not if the way you reply is plain-text).

> You are completely right that I took the profile on a smaller run with
> 2500 individuals, and the evaluation of the mate-choice is indeed the
> biggest bottleneck line of code in first() event. Given all your
> suggestions, I think I will step back and rethink the genetics
> required to answer my question and likely the mate-choice algorithm.
>
> I just profiled the rescaled chromosome size/mutation
> rates/recombination rates vs non-rescaled ones and you are totally
> right that it did not improve the simulation time at all.
>
> Regarding the separate chromosome model, is there any information out
> there on how to recapitate and overlay neutral mutations to
> treesequence files from multi-chromosome SLiM models?

At present, the suggested approach is to simply loop over the .trees
files in the trees archive generated by the multi-chromosome SLiM model
and recapitation/overlay each one separately.  The recapitated period
will therefore not follow identical pedigrees across all of the
chromosomes, but that is perhaps OK since those dynamics are neutral
anyway; the introduced bias will perhaps be small, especially if the
recapitation period does not have demography (i.e., is panmictic). 
(Maybe Peter or someone else can speak to the population genetics of
that better than I can.)  There is an example at
https://doi.org/10.47248/hpgg2505040006 that illustrates the basic
approach.  I hope that helps; happy modeling!
Reply all
Reply to author
Forward
0 new messages