Tree-sequence recording, rescheduling, and burn-in past coalescence

128 views
Skip to first unread message

Chase W. Nelson

unread,
Mar 31, 2021, 4:09:59 AM3/31/21
to slim-discuss
Greetings Ben and all! I'm hoping for guidance with a few questions; apologies in advance for the veritable firehose, and if I have missed explanations elsewhere.

Tree-sequence recording
  1. Section 24.13 (p. 576) of the manual (version 3.6) discusses the "permanent" flag of treeSeqRememberIndividuals(), but I am having trouble understanding why one would use it. Specifically, permanent=F will "mark the individuals only for (temporary) retention" — does that mean they will be removed again upon the next simplification? If so, why would one ever use the flag? I had naively thought that the only point of using the function treeSeqRememberIndividuals() was to remember individuals across simplification. As a start, the manual does suggest that what happens to non-permanently remembered individuals depends on the retainCoalescentOnly flag, but this seems only to influence whether unary (intermediate) nodes are kept in addition to branch points. So, I think I'm confused about the purpose of "permanent". (I recognize that even my question lacks clarity!)
  2. I also do not understand the following: "if a genome is kept by simplification, the genome's corresponding individual is kept also, if it is retained." Doesn't a genome being kept by simplification mean that the individual in which it occurred will be kept, by very definition? In other words, what is the meaning of the final "if it is retained"? (It seems it could be no other way?)
  3. Can the simplificationInterval be CHANGED at some point during the simulation? Similarly, can checkCoalescence be CHANGED from T to F at some point during the simulation? I would like to check for coalescence during burn-in only, but then make the simplificationInterval longer and disable checkCoalescence after coalescence is first detected, because I want to avoid the associated performance hits.
Rescheduling so as never to occur again
In my SLiM model, I have defined a named script block that checks for coalescence every 100 generations: 

/// CHECK FOR COALESCENCE ///
s1 1: late() {
    if(sim.generation % 100 == 0) {
        if(sim.treeSeqCoalesced()) {
            // record when coalescence happened
            defineConstant("t_coalescence", asInteger(sim.generation));
            catn(t_coalescence + ": COALESCED");
            
            // RESCHEDULE THIS block so it's never run again
            sim.rescheduleScriptBlock(s1, start = 1, end = 1); // ERROR: isn't allowed
            
            // RESCHEDULE ALL SUBSEQUENT blocks relative to coalescence time
        }
    }
}

However, after coalescence is detected, I don't want to check for it anymore. As shown, I was thinking to reschedule s1 (the block itself) for generation 1 (the past), which would guarantee it's never encountered again, but this produces an error:

ERROR (SLiMSim::CheckScheduling): event/callback scheduled for a past generation would not run.

What is the proper way to achieve the desired "discontinuing" of a script block?


Coalescence and burn-in time
I would like to run a forward-time burn-in (necessary because I'm using trinucleotide context-based mutation rates, which cannot yet be overlaid using msprime). I'd like the burn-in time to depend on when coalescence occurs for a given replicate; however, as the manual notes, the moment of coalescence does not necessary mean equilibrium has been reached. Thus, my question is how many generations might folks advise continuing the burn-in AFTER coalescence? Is there any agreed-upon rule of thumb (e.g., 2N, 10N, or something else)? Would be very grateful to know "best practices" in this regard!

With gratitude,
Chase

Dr Yan Wong

unread,
Mar 31, 2021, 5:24:10 AM3/31/21
to slim-discuss, Chase W. Nelson
Hi Chase - since I was largely responsible for your points 1. and 2. below, I’ll attempt a brief explanation.You might want to look at the pyslim docs here too:


Perhaps some of the confusion is to do with your point 2. In particular, bear in mind that genomes and individuals are completely different things in tree-sequence world. In a tree sequence you can have an individual with no known genome (a bit pointless) or, more to the point, you can have a genome with no associated individual: a “free-floating genome” if you will. This second case is, in fact, the norm for genomes in the past. If you run a normal SLiM simulation with tree sequence recording, you’ll find that all the necessary (coalescent ancestral) *genomes* in the past exist in the tree sequence, but are free-floating and *not* associated with an individual. So to answer your last question, it’s perfectly possible, in fact normal, to keep a genome via simplification but to throw away the individual in which the genome was residing at the time. This can turn out to be a bit of a pain because there’s some information about individuals that we sometimes might like to know, which isn’t a property of the genome (e.g. spatial location). The treeseqRememberIndividuals is to do with keeping individuals together with their associated genomes.

The “permanent=T” flag marks the genomes (nodes) of an individual as *samples*. That means at the end of a simulation, the individuals and their genomes will always be present in the tree sequence, even if they are not alive in the final generation and may not have contributed to the current day population. Think of them as genomes sequences from ancient permafrost. We (potentially) know their entire genome sequence, they are present in all the trees in the tree sequence, and we basically treat them just like an individual alive today (only they're dead). They are *never* removed by simplification.

If you set permanent=F, then although both the individual and its genome(s) are kept for the time being, they are not marked as “samples” (so for instance, you’ll find that they are not necessarily represented in all the trees in the tree sequence). Because they aren’t samples, they can be removed by simplification - in fact the genome (and the individual in which it sits) will only be kept if both (a) it contributes to the genomes that *are* marked as samples (usually that’s the current day population) AND (b) it represent a coalescent point in some tree in the tree sequence.

There are cases where you might want to keep these genomes and individuals for reason (a) even if (b) is not true (for instance, you want to “census” an entire generation of ancestors at a single timepoint). That’s when you might want retainCoalescentOnly=F.

This is new functionality, and it sounds like some of the docs could do with clarification, so do suggest improvements if you can see any.

Cheers

Yan

Ben Haller

unread,
Mar 31, 2021, 11:01:25 AM3/31/21
to slim-discuss
Hi Chase!  I think Yan took care of your points 1 & 2 (thanks Yan).  I'll try to address the rest of your questions.

You can't modify the simplificationInterval later, no.  If you want to do this sort of thing, just disable auto-simplification and call treeSeqSimplify() yourself, on whatever schedule you want.  You also can't modify the checkCoalescence flag later; the overhead is not large (at least in my experience), so if you want it turned on through your burn-in, you might as well just leave it on.  But if this proves to be a big performance issue, I could certainly add API to turn it off; just trying not to overcomplicate things unnecessarily.

If you want to reschedule a script block so that it never runs again, that's what the deregisterScriptBlock() method of SLiMSim is for.

I don't think there is any rule of thumb for how long to run after coalescence in order to get beyond the moment of coalescence (which is special and not representative of equilibrium) to a more unbiased equilibrium point, no.  I'm not even going to venture a guess, since I don't know enough about coalescent theory etc.; but perhaps someone else would like to?

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University

Peter Ralph

unread,
Mar 31, 2021, 11:02:15 AM3/31/21
to Chase W. Nelson, slim-discuss
> What is the proper way to achieve the desired "discontinuing" of a script block?

I believe this is sim.deregisterScriptBlock()

> I would like to run a forward-time burn-in (necessary because I'm using trinucleotide context-based mutation rates, which cannot yet be overlaid using msprime). I'd like the burn-in time to depend on when coalescence occurs for a given replicate; however, as the manual notes, the moment of coalescence does not necessary mean equilibrium has been reached. Thus, my question is how many generations might folks advise continuing the burn-in AFTER coalescence? Is there any agreed-upon rule of thumb (e.g., 2N, 10N, or something else)? Would be very grateful to know "best practices" in this regard!

2N seems totally fine to me.

Miguel Navascués

unread,
Apr 1, 2021, 5:25:36 AM4/1/21
to slim-d...@googlegroups.com
Hi,

Regarding the question of reaching equilibrium it seems to me that 2N is
unnecessarily long. Unless I am misunderstanding the problem, I think
that "few" generations should be ok. The problem of taking the tree
sequence at the same generation the simulation reaches full coalescence
is that this tree sequence will be shorter than expected under the
coalescent model. Think about simulating your coalescent time forward in
time (but with a coalescent model), your take your waiting times between
coalescent events from an exponential with mean 4N/(k(k-1)) where k-1 is
the number of terminal branches and you start with k=2, when you reach
the last necessary coalescent event (k=2N, first coalescent in backwards
time) you still need to simulate the waiting time from that coalescent
event to the sampling time. Since the "sample size" in the case of the
SLiM simulation is the size of the population, waiting time between
sample time and first coalescent is basically 1 generation. Thus, I
would say that there is no need for simulating 2N generations. Am I
missing something?

That is for a neutral model. If the model has selection on it things
could be trickier. For instance, if the simulation model includes some
rate of slightly deleterious mutations at the beginning of the
simulation there will be no mutations affecting the shape of the
genealogy. By the time the simulation reaches coalescence a significant
number deleterious mutations will be present in the simulation and they
will be affecting how the genealogy is generated. In that case I think
an additional longer period of additional simulations is required to
reach some sort of steady state.

I think in many cases you may want to empirically verify that you are in
a steady state by calculating key measures (diversity, etc) after
coalescence has been reached.

I will be happy to hear other opinions about what is necessary to "reach
equilibrium" in SLiM.

Best,

Miguel
> *_Tree-sequence recording_*
>
> 1. Section 24.13 (p. 576) of the manual (version 3.6) discusses the
> "*permanent*" flag of *treeSeqRememberIndividuals()*, but I am
> having trouble understanding why one would use it. Specifically,
> *permanent=F* will "mark the individuals only for (temporary)
> retention" — does that mean they will be removed again upon the
> next simplification? If so, why would one ever use the flag? I
> had naively thought that the only point of using the function
> *treeSeqRememberIndividuals() *was to remember individuals
> across simplification. As a start, the manual does suggest that
> what happens to non-permanently remembered individuals depends
> on the *retainCoalescentOnly* flag, but this seems only to
> influence whether unary (intermediate) nodes are kept in
> addition to branch points. So, I think I'm confused about the
> purpose of "*permanent*". (I recognize that even my question
> lacks clarity!)
> 2. I also do not understand the following: "if a genome is kept by
> simplification, the genome's corresponding individual is kept
> also, /if/ it is retained." Doesn't a genome being kept by
> simplification mean that the individual in which it occurred
> will be kept, by very definition? In other words, what is the
> meaning of the final "/if/ it is retained"? (It seems it could
> be no other way?)
> 3. Can the *simplificationInterval* be CHANGED at some point during
> the simulation? Similarly, can *checkCoalescence* be CHANGED
> from *T *to*F* at some point during the simulation? I would like
> to check for coalescence during burn-in only, but then make the
> *simplificationInterval* longer and disable *checkCoalescence*
> after coalescence is first detected, because I want to avoid the
> associated performance hits.
>
> *_Rescheduling so as never to occur again_**_
> _*
> In my SLiM model, I have defined a named script block that checks
> for coalescence every 100 generations:
>
> /// CHECK FOR COALESCENCE ///
> s1 1: late() {
>     if(sim.generation % 100 == 0) {
>         if(sim.treeSeqCoalesced()) {
>             // record when coalescence happened
>             defineConstant("t_coalescence",
> asInteger(sim.generation));
>             catn(t_coalescence + ": COALESCED");
>
>             // RESCHEDULE THIS block so it's never run again
>             sim.rescheduleScriptBlock(s1, start = 1, end = 1); //
> ERROR: isn't allowed
>
>             // RESCHEDULE ALL SUBSEQUENT blocks relative to
> coalescence time
>         }
>     }
> }
>
> However, after coalescence is detected, I don't want to check for it
> anymore. As shown, I was thinking to reschedule *s1* (the block
> itself) for generation 1 (the past), which would guarantee it's
> never encountered again, but this produces an error:
>
> ERROR (SLiMSim::CheckScheduling): event/callback scheduled for a
> past generation would not run.
>
> What is the proper way to achieve the desired "discontinuing" of a
> script block?
>
>
> *_Coalescence and burn-in time_*
> I would like to run a forward-time burn-in (necessary because I'm
> using trinucleotide context-based mutation rates, which cannot yet
> be overlaid using *msprime*). I'd like the burn-in time to depend on
> when coalescence occurs for a given replicate; however, as the
> manual notes, the moment of coalescence does not necessary mean
> equilibrium has been reached. Thus, my question is *how many
> generations might folks advise continuing the burn-in AFTER
> coalescence*? Is there any agreed-upon rule of thumb (e.g., 2N, 10N,
> or something else)? Would be very grateful to know "best practices"
> in this regard!
>
> With gratitude,
> Chase
>
> --
> SLiM forward genetic simulation: http://messerlab.org/slim/
> <http://messerlab.org/slim/>
> ---
> You received this message because you are subscribed to the Google
> Groups "slim-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to slim-discuss...@googlegroups.com
> <mailto:slim-discuss...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/slim-discuss/2e9c6ceb-07d4-4338-b379-e52b68bd9c98n%40googlegroups.com
> <https://groups.google.com/d/msgid/slim-discuss/2e9c6ceb-07d4-4338-b379-e52b68bd9c98n%40googlegroups.com?utm_medium=email&utm_source=footer>.


--
Miguel NAVASCUÉS

UMR CBGP, INRAE
Centre de Biologie pour la Gestion des Populations
755 avenue du campus Agropolis
CS30016
34988 Montferrier-sur-Lez cedex (France)

phone: +33499623370
fax: +33499623345
e-mail: miguel.navascues AT inrae.fr

Peter Ralph

unread,
Apr 1, 2021, 10:06:58 AM4/1/21
to Miguel Navascués, slim-discuss
> Regarding the question of reaching equilibrium it seems to me that 2N is
> unnecessarily long.

I would like to agree with you! But: consider the case of no
recombination, so there's a single tree. The total height of the tree
(regardless of the number of samples) is of order 2N, with standard
deviation also of order 2N. Let TMRCA(t) be the time until most recent
common ancestor of everyone alive at time t, and imagine plotting
TMRCA(t) against t. This will go up linearly and then drop down
suddenly, in a sawtooth pattern; from these statistics we know that
the range over which it varies is of order 2N, so the time between
jumps is of order 2N. Now imagine starting a forwards simulation at
some point along this trajectory. The time at which the history we
have recorded contains the MRCA is the time that the TMRCA first jumps
down, which is going to be at the very lowest end of typical TMRCA. To
wait for equilibrium really we ought to wait long enough to go over a
few more sawtooths; I suspect 2N is good enough in practice.

--p

Chase W. Nelson

unread,
Apr 1, 2021, 1:06:49 PM4/1/21
to slim-discuss
A THOUSAND thank-yous, all, for this swift and incredible help! So as not to waste anyone's time, I will need to process this information more fully before responding properly, but already things have become MUCH clearer!

With gratitude,
Chase

Chase W. Nelson

unread,
Apr 6, 2021, 12:10:44 PM4/6/21
to slim-discuss
Dear Yan, Ben, Peter, and Miguel: 

Thank you again for this wonderful help!

As a follow-up, my purpose in using tree-sequence recording is to track local ancestry. My simulation splits several populations and lets them evolve separately for very long periods of time (i.e., "speciate"). My first goal is to determine the age and population of the MRCA at each site in the genome. My second goal is to determine, at the end of the simulation and for a given pair of extant populations, which shared polymorphisms that are identical by state (e.g., a C>T SNP at the same site in both populations) are due to (a) recurrent mutation vs. (b) already segregating in the ancestral population (i.e., trans-species polymorphism identical by descent). 

Given the MRCA goal, I think I can use initializeTreeSeq(retainCoalescentOnly=T) and then use treeSeqRememberIndividuals(permanent=F) in the generation prior to population splits. The first option would only keep branch points; the second option would save the most recent (closest-to-present) individuals from each subpopulation before it splits, but will get rid of individuals if neither of their genomes contribute to any extant individuals. Given the goal of categorizing shared vs. recurrent polymorphisms, I think new mutational lineages are completely independent of coalescence, i.e., a mutation's ID is sufficient to categorize it. Does that all seem correct?

A second follow-up question has to do with ending each subpopulation at different times. I was hoping to account for differences in generation times between different 'species' (i.e., isolated SLiM subpopulations) by running them for different numbers of generations. I would then record/sample each one at its appropriate time, starting with the one having the longest generation time (fewest generations of evolution). I now wonder, will this affect coalescence? For example, I am worried that if I output subpopulation p4 before I output p5, the MRCA at a particular position could change in the interim (especially if I remove p4 for efficiency). Do you have any ideas for dealing with this eventuality?

Finally, huge thanks to Peter for the explanation of using 2N past coalescence as a reasonable burn-in time — I will do that!

With gratitude,
Chase

Dr Yan Wong

unread,
Apr 6, 2021, 12:23:34 PM4/6/21
to slim-discuss, Chase W. Nelson
On 6 Apr 2021, at 17:10, Chase W. Nelson <cwnel...@gmail.com> wrote:

Dear Yan, Ben, Peter, and Miguel: 

Thank you again for this wonderful help!

No problem.

Given the MRCA goal, I think I can use initializeTreeSeq(retainCoalescentOnly=T) and then use treeSeqRememberIndividuals(permanent=F) in the generation prior to population splits.

In this particular case, it sounds like you don’t need to remember any individuals at all. Remember, “individuals” and “genomes” are separate entities. Each genome (node) has both a population and a time. So even in a tree sequence with no individuals, the tree sequence will have all the MRCA nodes with their times.

If you wanted the lat/long of the MRCAs (which is stored in the individual), you would need treeSeqRememberIndividuals.

The best thing is to try out on a small example simulation and see if the information that you need is in there.

Yan


Peter Ralph

unread,
Apr 6, 2021, 2:30:23 PM4/6/21
to Chase W. Nelson, slim-discuss
> As a follow-up, my purpose in using tree-sequence recording is to track local ancestry. My simulation splits several populations and lets them evolve separately for very long periods of time (i.e., "speciate").
> My first goal is to determine the age and population of the MRCA at each site in the genome.
> ...
> Given the MRCA goal, I think I can use initializeTreeSeq(retainCoalescentOnly=T) and then use treeSeqRememberIndividuals(permanent=F) in the generation prior to population splits. The first option would only keep branch points; the second option would save the most recent (closest-to-present) individuals from each subpopulation before it splits, but will get rid of individuals if neither of their genomes contribute to any extant individuals.

As Yan said, you probably don't need to do any Remembering for this.
doing treeSeqRememberIndividuals( ) helps you keep around Individuals,
but you likely only need the nodes (ie, genomes). It depends what you
want to know about the MRCAs - time and population are associated with
nodes (ie ancestral genomes); it's only if you want to know the
spatial location (in a continuous space model) or the age or sex or
something of those ancestors that you'd need to do any Remembering.
So, you probably just need to do initializeTreeSeq()?

> My second goal is to determine, at the end of the simulation and for a given pair of extant populations, which shared polymorphisms that are identical by state (e.g., a C>T SNP at the same site in both populations) are due to (a) recurrent mutation vs. (b) already segregating in the ancestral population (i.e., trans-species polymorphism identical by descent).
> Given the goal of categorizing shared vs. recurrent polymorphisms, I think new mutational lineages are completely independent of coalescence, i.e., a mutation's ID is sufficient to categorize it.

You're right - no need to do anything special to tell if different
alleles are identical or not. Each unique mutational origin is stored
separately.


> A second follow-up question has to do with ending each subpopulation at different times. I was hoping to account for differences in generation times between different 'species' (i.e., isolated SLiM subpopulations) by running them for different numbers of generations. I would then record/sample each one at its appropriate time, starting with the one having the longest generation time (fewest generations of evolution). I now wonder, will this affect coalescence? For example, I am worried that if I output subpopulation p4 before I output p5, the MRCA at a particular position could change in the interim (especially if I remove p4 for efficiency). Do you have any ideas for dealing with this eventuality?

I don't think there's a problem here - if you simulate longer in some
branches, then you just grow the tips of those trees longer - it won't
change the shared portion of history. Two ways you might do this:
1. when you want to end each species' branch, do
treeSeqRememberIndiivduals( ) on that population, and then get rid of
it in SLiM
2. if there's no interbreeding between the species, you can simulate
them separately and use the tree sequence operation union( ) to put
them together. Murillo Rodrigues has got some tools together to do
this, but we haven't gotten around to writing up an example.
The first option is easier, but the second one lets you run bigger simulations.

-- peter
Reply all
Reply to author
Forward
0 new messages