Non-neutral spatial simulation

Kamolphat Atsawawaranunt

unread,

Dec 12, 2025, 12:11:02 AM12/12/25

to slim-discuss

Dear all,

I am trying to run a large non-neutral spatial simulation. After scaling the models with runs with fewer individuals (up to 20000 individuals), it appears that the full model (about 2 million individuals) will likely take too long (> two weeks, the scaling was not linear so estimate is not very accurate) and require a lot of memory (~350 GB).

The main aim is to simulate locally adapted native range populations to two environmental variables, before introducing individuals to new spatial environments, with the final goal to investigate how quantitative trait nucleotides may sort itself in new environments. The environment and the (scaled) genome mimics what I know of the species biology.

I only simulate non-neutral mutations (QTLs) on a 1MB genome (34 "pseudo-chromosomes" of varying size and recombination rate) and plan to recapitate and overlay with mutations in pyslim. Due to this, I have treesequencing turned on and used pseudo-chromosomes instead of separate chromosomes.

The main feature of the simulations:

The genome is scaled in relation to the "observed" values, so the genome is 1000x smaller, with recombination rates and mutation rates 1000x higher.
Life-long monogamy
Territoriality (mated individuals do not move)
QTLs are pleiotropic and contain two effect sizes linked the two environmental variable
First 1000 ticks with environmental values of zero with weak stabilising selection to generate standing genetic variation
Tick 1001 - 2000 with environmental values slowly shifting from zero to the observed values with stronger stabilising selection
Tick 2000 - 3000 with the observed environmental values.

Please find the slim script and the CSV file for the recombination rates attached. The CSV file was generated in a windows computer so it might require dos2unix to work on a unix computer. The slim script was modified to make it more easily reproducible with randomly generated environmental maps, but it has a similar scale to the real run.

Profiling in SLiMgui seems to suggest that the majority of the run time is used in first() (~35%, probably for the life-long monogamy part) and early() (~12%, QTL fitness quantification) and reproduction/offspring generation (~10%), and the majority of the memory usage is spent of tree-sequence tables.

After some testing, the amount of memory used appears to be linearly related to the number of ticks run, and also the number of individuals in the simulation.

I tried making the genome smaller (but having to increase mutation and recombination rates by the same factor) but this does not seem to reduce the memory usage.

Does anyone have any suggestions on how this could be made more efficient? I actually do not need tree sequencing turned on, but then I would have to track many neutral mutations in my simulation.

I scaled the number of the individual in the simulation by increasing the grid cell size (res_scale) in which the density effects are calculated (which will likely create some artifactual spatial structure at this point) and the decrease the dispersal distance (disp_dd) and mate choice interaction (SIGMA_M) by the same amount.

Kamolphat (A)

Predicted_recombination_to_simulate.csv

NativeRangeSpatial_reproduce.slim

Ben Haller

unread,

Dec 12, 2025, 9:27:33 AM12/12/25

to slim-d...@googlegroups.com

Hi Kamolphat!

Please attach the profile that you took; given the scale of the model, you shouldn't expect us to take a profile of it ourselves. In general, provide the actual information that you have, rather than trying to just verbally summarize it. :->

I'm puzzled that you call out three aspects of the profile, but those three aspects barely total more than 50%, so I'd like to see where the other ~50% is spent. You also say that for first() events it is "probably for the life-long monogamy part", but why "probably"? The profile should show you the exact lines of code that are hotspots within your events; this should not be a matter of speculation.

Is the profile taken with a smaller version of the model, or at your desired full scale of 2 million individuals? You say that the scaling of the model was not linear, so if you're trying to get it to run at full scale, a profile taken at small scale will not be informative regarding where the performance problems are. As the number of individuals increases, some aspects of a model will typically scale linearly (not a big problem) whereas others will scale quadratically or worse (a big problem). So you need to figure out what part of the model is not scaling well, and fix that. Your profile also ought to be taken at a point when mutational density has more or less equilibrated; profiling a model at the very start of a burn-in period is very misleading. Ideally, you profile across a full run of the model so that your profile captures the totality of what is happening, because as a model runs the location of the hotspots in it can change drastically.

You say that "the majority" of the memory usage is the tree-sequence recording tables, and it sounds like tree-sequence simplification isn't a major part of your runtime (you didn't mention it), so an obvious place to start to get the memory usage down would be to increase the frequency of simplification; did you try that?

Your justification for using pseudo-chromosomes instead of separate chromosomes doesn't really make sense to me. You might try using separate chromosomes, which might make the tree-sequence recording information smaller since it would then not be struggling to record the free recombination between your 34 chromosomes. With pseudo-chromosomes it is having to record all of the recombinations between the chromosomes, for every individual it generates, which is probably a huge amount of information and would totally destroy the correlations between adjacent trees in the tree sequence. I don't know, but that seems probably bad.

You say that you rescaled the genome to be 1000x smaller with a 1000x higher mutation and recombination rate. That might not end up being a win, since you will have the same number of mutations segregating, just at a higher density. It might actually slow SLiM down, in fact, since it will have to deal with a lot more mutation stacking, and since its mechanisms for taking advantage of shared haplotype structure might not work as well. So you might try just not doing that. :-> My guess is that it probably makes little difference either way, given that the number of mutations remains the same and the density is not that high even after rescaling; but it'd be interesting to check and see. The more typical model scaling that people do is described in section 5.5 of the SLiM manual, but it probably doesn't work well at all for a spatial model, and would have to be done extremely carefully in that context, so I am not recommending it, just noting it. People don't generally rescale the genome length in the way you're doing precisely because it doesn't generally end up making much difference, I think. :->

I also note that although you've rescaled the genome to be 1000x smaller, you still have 1 million sites in your model, all of which are deemed to be potential QTL sites. From a nucleotide-level perspective (QTNs) perhaps that makes sense; but you could shift to a gene-based perspective and model just perhaps 1000 sites as QTLs, if that would make sense. With the downscaled version of the model that you provided, after 1000 ticks there are only 400 mutations segregating, but that version of the model appears to have a population size of only about 2500, so if you want to scale up to 2 million, you're going to have a *lot* of QTNs in your model. Most of those will probably be extremely rare, but all of them will need to be tracked by SLiM, recorded in the tree-sequence tables, etc. Is that realistic, and do you really need that many QTNs, and that level of genetic detail? How about maybe a smaller number of causal sites, with perhaps a correspondingly larger distribution of effect sizes, or something like that? Maybe you need to be simulating what you're simulating, I don't know; but this seems worth thinking about.

You define two spatial maps in your 1 first() event, but you name them both "world", so the second map will replace the first one, I think. Seems like a bug.

In your first() event you're calling i2.evaluate() after each individual chooses a mate. That is going to be lethal, performance-wise. You need a better mate-choice algorithm.

Overall, for a model with 2 million individuals, two traits with pleiotropy, 34 chromosomes, a 1000 MB genome (scaled, but with the same number of mutations, and with every site potentially being causal), spatial, running for 3000 ticks, with tree-sequence recording but without a recapitated burn-in... it's going to take a while, and use a lot of memory. Such is life. :-> So while you may be able to speed the model up significantly by working with the profile and adjusting some aspects of it, in the end, resign yourself to starting your runs and then going and working on something else for a while. Keep in mind that biologists who do fieldwork often have to spend years collecting data for a single experiment – and a lot of that is manual labor, not just letting a computer do the work! :-> Redesigning your mate-choice algorithm might help a lot; think of a way to design it that only requires a single evaluation of the interaction per tick. And rethinking how you're handling the genetics might help; I'm not sure exactly what to propose, but it seems like you've got way more causal sites than you probably need. Beyond that, it's hard to guess further; you need to provide a high-quality profile of the model running at full scale.

Overall,

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University

Kamolphat (A) --
SLiM forward genetic simulation: http://messerlab.org/slim/
Before posting, please read http://benhaller.com/slim/bugs_and_questions.html
---
You received this message because you are subscribed to the Google Groups "slim-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slim-discuss...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/slim-discuss/d401411e-d3a7-43fa-bcd7-6410043dbaa5n%40googlegroups.com.

Peter Ralph

unread,

Dec 12, 2025, 12:46:07 PM12/12/25

to slim-d...@googlegroups.com, Ben Haller

Small note: You say the scaling is not linear; that makes me suspect that part of the problem might be the issues around scaling up spatial models that are described, and solved, in https://onlinelibrary.wiley.com/doi/full/10.1002/ece3.71098

But - you should confirm this by profiling.

From: 'Ben Haller' via slim-discuss <slim-d...@googlegroups.com>
Sent: Friday, December 12, 2025 6:27 AM
To: slim-d...@googlegroups.com <slim-d...@googlegroups.com>
Subject: Re: Non-neutral spatial simulation

This message originated outside the UO email ecosystem.

Use caution with links and attachments. Learn more about this email warning tag.

Report Suspicious

To view this discussion visit https://groups.google.com/d/msgid/slim-discuss/c1583d34-9edc-4862-8534-ad62df363ec8%40mac.com.

Kamolphat Atsawawaranunt

unread,

Dec 14, 2025, 11:36:52 PM12/14/25

to Peter Ralph, slim-d...@googlegroups.com, Ben Haller

Dear Ben and Peter,

Thank you so much for your reply and suggestions and also pointing out the bug in my code. I apologise for not attaching the profile I took. For future reference, is this best done by copying the profile and attaching it as a text file or by a screenshot? Hopefully I will be able to make better diagnoses from the profile report and not make these speculations.

You are completely right that I took the profile on a smaller run with 2500 individuals, and the evaluation of the mate-choice is indeed the biggest bottleneck line of code in first() event. Given all your suggestions, I think I will step back and rethink the genetics required to answer my question and likely the mate-choice algorithm.

I just profiled the rescaled chromosome size/mutation rates/recombination rates vs non-rescaled ones and you are totally right that it did not improve the simulation time at all.

Regarding the separate chromosome model, is there any information out there on how to recapitate and overlay neutral mutations to treesequence files from multi-chromosome SLiM models?

Yours sincerely,
A (Kamolphat)

To view this discussion visit https://groups.google.com/d/msgid/slim-discuss/PH0PR10MB472799F1720E65A082647DCEA5AEA%40PH0PR10MB4727.namprd10.prod.outlook.com.

Ben Haller

unread,

Dec 15, 2025, 5:00:39 AM12/15/25

to slim-d...@googlegroups.com

Hi A,

>
> Thank you so much for your reply and suggestions and also pointing out
> the bug in my code. I apologise for not attaching the profile I took.
> For future reference, is this best done by copying the profile and
> attaching it as a text file or by a screenshot? Hopefully I will be
> able to make better diagnoses from the profile report and not make
> these speculations.

To your question, that will likely depend on how you are replying –
whether you answer on the web or by email, and your email client. Just
make sure that the colors in the profile report come through (which they
would not if the way you reply is plain-text).

> You are completely right that I took the profile on a smaller run with
> 2500 individuals, and the evaluation of the mate-choice is indeed the
> biggest bottleneck line of code in first() event. Given all your
> suggestions, I think I will step back and rethink the genetics
> required to answer my question and likely the mate-choice algorithm.
>
> I just profiled the rescaled chromosome size/mutation
> rates/recombination rates vs non-rescaled ones and you are totally
> right that it did not improve the simulation time at all.
>
> Regarding the separate chromosome model, is there any information out
> there on how to recapitate and overlay neutral mutations to
> treesequence files from multi-chromosome SLiM models?

At present, the suggested approach is to simply loop over the .trees
files in the trees archive generated by the multi-chromosome SLiM model
and recapitation/overlay each one separately. The recapitated period
will therefore not follow identical pedigrees across all of the
chromosomes, but that is perhaps OK since those dynamics are neutral
anyway; the introduced bias will perhaps be small, especially if the
recapitation period does not have demography (i.e., is panmictic).
(Maybe Peter or someone else can speak to the population genetics of
that better than I can.) There is an example at
https://doi.org/10.47248/hpgg2505040006 that illustrates the basic
approach. I hope that helps; happy modeling!

hyan...@googlemail.com

unread,

Dec 16, 2025, 10:20:05 AM12/16/25

to slim-discuss

Another possibility, that I have not tested in any way, would be to join all the tree sequences together, then (if necessary) split them up again:

import tskit

dirname = "path/to/slim/archive"

chroms = {filename: tszip.load(os.path.join(dirname, filename)) for filename in os.listdir(dirname) if filename.endswith(".trees")}

pos = 0

regions = {}

for name, ts in chroms.items():

regions[name] = pos, pos + ts.sequence_length

pos += ts.sequence_length

# Squish all the tree sequences together

ts = chroms[0].concatenate(*chroms[1:]) # requires tskit >= 0.6.4

# do the recapitation thing

recap_ts = pyslim.recapitate(ts, ...)

# now save out to separate tree seqs, if required

for name, interval in regions.items()

tmp_ts = recap_ts.keep_intervals([interval]).trim()

assert tmp_ts.sequence_length == chroms[name].sequence_length

tmp_ts.dump("recapitated_" + name)

It would be nice to make that sort of splitting-up easier, e.g. by storing the breakpoint locations in the metadata. I'll have a think about what might work here.

Ben Haller

unread,

Dec 16, 2025, 12:18:43 PM12/16/25

to slim-discuss

Hi Yan! Yes, good point – if the tree sequences are not so huge that memory is an issue, you could potentially stick them all together and treat them as one.

Yan, given that all the chromosome tree sequence share the same node/individual/population tables, does doing this yield a single tree sequence that simply has the same node/individual/population tables as the original? And when you do the recapitation, you'd want a recombination rate map that has r=0.5 points at the joins between the chromosomes; is there a convenient way to do that (or does the concatenate() function already do it somehow)?

This definitely seems like an area where some helper function in tskit, or adjacent to tskit, would be useful. :->

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University

hyan...@googlemail.com

unread,

Dec 16, 2025, 2:53:45 PM12/16/25

to slim-discuss

Hi Ben, I forgot that you should probably specify `add_populations=False`, which assumes that the population tables (and IDs) are equivalent between the tree sequences. By default the `concatenate` function makes new nodes for all the non-samples, but if we can be assured that the node tables are all identical (I think we can in this case?) then we should also pass `node_mapping=np.arange(ts.num_nodes)` to the `concatenate` function: good spot!

There's nothing that creates the appropriate rate map (and note that to get a probability of 1/2, the correct rate value is log(2) not 1/2, see https://tskit.dev/msprime/docs/stable/ancestry.html#simulations-without-a-fixed-pedigree). You are right that helper functions would be useful here. I have kicked off the discussion at https://github.com/tskit-dev/pyslim/issues/361#issuecomment-3662082268.

Kamolphat Atsawawaranunt

unread,

Dec 17, 2025, 8:16:46 PM12/17/25

to slim-discuss

Hi Ben and Yan,

Thank you for your suggestions, and thank you for all tools in SLiM and tskit that you have made. Following your thread on tskit, I look forward to new tools for multichromosome in tskit.

Ben, following your suggestions to go multi-chromosome, I modified my slim codes but the simulation constantly crashes (Test_crash_multichrom.slim attached). The identical model would run with treesequencing turned off (commenting out line 36). Identical model with the pseudo-chromosome approach would also run with no problems (not attached). I initially thought it was something to the SLiMgui on my windows laptop so I ran the same model on a UNIX-based cluster, requesting plenty of memory (>10x the memory used in a pseudo-chromosome model) and it also crashes, citing core segmentation fault. The model seems to crash no matter how small my chromosomes are.

I quickly made the following slim code at: https://tskit.dev/pyslim/docs/latest/vignette_space.html multi-chromosomal and this seems to run with no problems (test_nocrash2.slim attached). Fundamentally, the two attached scripts are both multi-chromosome spatial models with treesequencing turned on, with the main difference being the QTNs.

I do not understand why the latter code would run but not the former code. With the changes I have made, I may be able to proceed with the pseudo-chromosome approach, but I thought I should flag it here in case it is some bug.

Yours sincerely,
A (Kamolphat)

Test_crash_multichrom.slim

test_nocrash2.slim

Ben Haller

unread,

Dec 18, 2025, 12:46:02 AM12/18/25

to slim-d...@googlegroups.com

Hi A!

I ran your "crashes" model all the way through and didn't get a crash. Let's take this off-list, since this is not of general interest; in general, bug reports ought to go into new issues in GitHub, rather than on slim-discuss. So, please open a new issue on GitHub for SLiM and give me some more information:

- Does the crash happen every time for you, or only occasionally? If only occasionally, is it reliably reproducible with a specific random number seed?

- When you say "crash", it sounds like you're getting a segfault. Can you provide a crash report with a backtrace?

- What version of SLiM are you running? On what platform exactly?

Again, please provide your responses in a new GitHub issue, not on-list. Thanks!

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University

--
SLiM forward genetic simulation: http://messerlab.org/slim/
Before posting, please read http://benhaller.com/slim/bugs_and_questions.html
---
You received this message because you are subscribed to the Google Groups "slim-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slim-discuss...@googlegroups.com.

To view this discussion visit https://groups.google.com/d/msgid/slim-discuss/56420fe1-e30d-4c1b-9941-d9bbb390fe9an%40googlegroups.com.

Reply all

Reply to author

Forward