Using a Burn-in to Interpolate Nucleotide Sequences Across Space

15 views
Skip to first unread message

Sohan Alleshwaram

unread,
Aug 11, 2025, 10:01:55 PMAug 11
to slim-discuss
Hi everyone!

I'm currently trying to run simulations where I have many subpopulations I wish to simulate on but I only have a vcf file for a few of them. Because of this, I want to try to interpolate the sequences for the subpopulations which lack a corresponding sequence by running a burn-in. My idea is that I can hold the sequences for the subpopulations I have data for constant, letting the other subpopulations reach an equilibrium.

My issue is, I'm not sure about how to hold these subpopulations constant. What I am thinking is that I can call readHaplosomesFromVCF() on each tick for each  subpopulation which has genetic data in order to "reset" it. 

However, this seems quite inefficient as each VCF will have to be read many times. 
Are there any ways to go about this differently that aren't so resource intensive? Also, if anyone thinks I'm thinking about the problem wrong and could be doing something else, please do tell.

Best,
Sohan Alleshwaram



Ben Haller

unread,
Aug 12, 2025, 1:57:39 AMAug 12
to Sohan Alleshwaram, slim-discuss
Hi Sohan!

I guess I'd suggest two alternatives.  (1) Create and burn in the other populations first, and then once that is done, create the VCF-based populations and load them in; or (2) In a nonWF model, create all the subpopulations at the beginning, and load the VCF-based populations; and then until the burn-in is finished, in the VCF-based subpopulations simply don't generate any new offspring (use a reproduction() callback that does nothing), and don't kill anyone.  I think either of those ought to work.

The problem, it seems to me, is that your VCF-based populations will share some genetic diversity due to common ancestry, but your burn-in populations will not.  I'm not sure what to do about that, though.  :->  I guess maybe I'd wonder: are you quite sure that it is of benefit to directly use the VCF data that you have at all?  What does it get you?  Maybe it is more important for your subpopulations to be related to each in realistic ways.  You could use the VCF data to get FST estimates between your subpops, and estimate expected FSTs for the other subpops based on geographical distance if you have no other informative data, and then make a model that burns in all of the subpopulations with migration such that the end result of the burn-in shows FST values that fit your empirical estimates.  Maybe that would be better?  But it really depends on the research questions you're trying to address; if there is a strong reason to use the VCF sequences that you have, then I guess you should do that.  :->  Good luck, and happy modeling!

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University


'Sohan Alleshwaram' via slim-discuss wrote on 8/12/25 5:01 AM:
--
SLiM forward genetic simulation: http://messerlab.org/slim/
---
You received this message because you are subscribed to the Google Groups "slim-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slim-discuss...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/slim-discuss/a5f30aca-be97-492f-b198-55f59fdec145n%40googlegroups.com.

Sohan Alleshwaram

unread,
Aug 12, 2025, 3:07:35 AMAug 12
to slim-discuss

Hi Ben,

Thank you for your response, I really appreciate your help. 

However, I think that you might be misunderstanding the problem I'm having. The goal of the burn-in is to make it so that all the subpopulations share genetic diversity, with the subpopulations closer to each other sharing more diversity. This is to create the starting conditions for a simulation which will be run for just a few tens of generations to see how individual mutations spread. All of the subpopulations are representative of real-life sites where we have collected data, its just that we lack sequenced genomes for a lot of them and because of this, we want to estimate these missing genomes so we can run the simulation from a realistic starting point. Because of the low amount of generations I am looking at for the actual simulation to analyze, the plan is to run the burn-in and simulation without mutations being generated by SLiM just to see how individual pre-existing SNPs spread and move within the greater population over this short time period.

I do think you are raising some valid points though. You mentioned that the burn-in populations would have a different common ancestry which makes sense. Could a solution to that be assigning each subpopulation the spatially nearest VCF, and then running the burn-in while holding the original sites with the genetic data constant by the means I specified above ("resetting" the sites with corresponding data each generation by loading its VCF)? This could theoretically make sure each subpopulation has the same ancestry while also serving to interpolate the VCFs over the space as sites between the sites being held constant will evolve to have sequences that are "between" the sites with data.

I do think that using the VCF data I have is important for the research we are doing, though I will discuss this idea of not using the VCFs with the people I am working with and just relating the subpopulations in ways similar to what we have observed.

If I haven't explained something well or you want to ask any other questions to better answer my question, please tell me.

Thanks again for your help.

Best,
Sohan Alleshwaram

Peter Ralph

unread,
Aug 12, 2025, 6:09:06 AMAug 12
to slim-discuss, Sohan Alleshwaram
Hi, Sohan - it sounds to me like what you are wanting to do is effectively to condition the simulation on the observed sequences. However, it's also pretty much totally impossible. For instance: suppose you want to run a complex simulation conditioned on the allele frequency at a single locus. The only general way to do this is by rejection sampling: run simulations until you find one with the desired frequency at the locus. This is inefficient but possible. Now, how about conditioning on frequencies at two loci: the chance a given simulation has the desired combination is vanishingly unlikely. Entire sequences just won't work. A do-able approach is to set up the burn-in so that the higher-level descriptors of sequence similarity (eg diversity, divergence) roughly match the observed data.

But, maybe I misunderstand?

Best of luck,
  Peter

From: 'Sohan Alleshwaram' via slim-discuss <slim-d...@googlegroups.com>
Sent: Tuesday, August 12, 2025 12:07 AM
To: slim-discuss <slim-d...@googlegroups.com>
Subject: Re: Using a Burn-in to Interpolate Nucleotide Sequences Across Space
 

Ben Haller

unread,
Aug 12, 2025, 6:44:40 AMAug 12
to Sohan Alleshwaram, slim-discuss
Hi Sohan,

Ah, I see.  I was misled by the term "burn-in", which normally does involve new mutations.  :->  But you want some period of equilibration between the subpops with genetics and the subpops that initially have no genetics, I guess.  OK.

I'm not familiar with anybody having done this sort of thing before, so as far as I know you're breaking new ground.  :->  I don't really have any idea what the best procedure might be, from the perspective of trying to generate a realistic pattern of diversity in the model.  That seems like a question that would be deserving of its own separate research project, really!

I think the approach you describe might or might not be a good approximation, depending upon what the reality of the missing data is.  If all the subpopulations are really quite similar, then perhaps it is fine.  If they have quite different environments, have been divergent for a long time, etc., then probably what you suggest will not produce something that resembles reality.  Again, it sounds like a whole research project in itself.  :->

I don't think I have more to add, beyond that and what Peter has just said.  Good luck and happy modeling!


Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University


'Sohan Alleshwaram' via slim-discuss wrote on 8/12/25 10:07 AM:
Reply all
Reply to author
Forward
0 new messages