# Questions about treeseq good practices

55 views

### hern....@gmail.com

Sep 22, 2021, 5:32:07 AM9/22/21
to slim-discuss

Hi all,

I have two questions related to treeSeq recording and want to hear your opinion in case I’m violating some rules or assumptions

1)    When doing nonWF models we often want to estimate the effective population size and figure out the N:Ne ratio. For this I normally run a long-ish simulation to achieve near equilibrium and estimate Ne=4/pi*mu. I want to do this step with treeseq instead, so I’m thinking:

a.     Calculate generation time of simulation

b.     Run a simulation with no mutations

c.     Recapitate and mutate with (mu/generation time)

d.     Calculate pi at time 0

e.     Question, should I do:

i.     Ne=4/pi*mu  or

ii.     Ne=4/pi*(mu/generation time)

Do you think this is a valid approach?

2)    My second question is about doing a bit of an unconventional series of steps to do a non-neutral nonWF simulation. Neutral mutations normally take much longer to reach equilibrium than deleterious mutations, and I want to avoid running long simulations with both mutation types. So, I have been doing

a.     Run a simulation with no mutations

b.     Recapitate and mutate with (mu/generation time)

c.     Wrote a script to take the tree at time 0 and write a file in slim output

d.     Go back to slim (without tree recording) and read the slim output with neutral mutations (in theory in equilibrium)

e.     Include deleterious mutations from this time point and wait for those to reach ~equilibrium

Is this a valid approach? I’m aware I’m losing all the genealogical information of the first run, but I’m not worried about that because I care more about what happens from step “d” onwards.

cheers,

Hernán

### Miguel de Navascués

Sep 22, 2021, 9:58:32 AM9/22/21
Hi Hernán,

Regarding the calculation of Ne, the scheme that you propose seems to be
kind of circular: which Ne are you using for the recapitation step?

An alternative way to calculate Ne from the simulations will be from the
actual reproductive events at each time step (you can make an harmonique
mean of the Ne(t) at the end). I am not very familiar with the equations
to do this calculation in the case of overlapping generations but I
think Grimm et al (2018, doi:10.1111/2041-210X.12530) could be a good
place to look for them if you think this approach suits your needs.

For your second question, why not doing step (e) with deleterious
mutation until reach some sort of equilibrium and then doing
recapitation (if necessary) and adding neutral mutations on the tree
sequence? I do not see how your more complex approach will give
different results.

In any case, I think that if you want to start from a population with
neutral mutations, it would be more efficient and safe to simulate the
whole population with msprime >=1.0.0 (using "dtwf" model) and export
the tree sequence so it can used by SLiM as an input. This would also
allow to keep the genealogical information and that might be useful.

Miguel

On 22/09/2021 11:32, hern....@gmail.com wrote:
> Hi all,
>
> I have two questions related to treeSeq recording and want to hear your
> opinion in case I’m violating some rules or assumptions
>
> 1) When doing *nonWF* models we often want to estimate the effective
> population size and figure out the N:Ne ratio. For this I normally run a
> long-ish simulation to achieve near equilibrium and estimate Ne=4/pi*mu.
> I want to do this step with treeseq instead, so I’m thinking:
>
> a. Calculate generation time of simulation
>
> b. Run a simulation with no mutations
>
> c. Recapitate and mutate with (mu/generation time)
>
> d. Calculate pi at time 0
>
> e. Question, should I do:
>
> i.     Ne=4/pi*mu  or
>
> ii.     Ne=4/pi*(mu/generation time)
>
> *Do you think this is a valid approach?*
>
> 2) My second question is about doing a bit of an unconventional series
> of steps to do a *non-neutral nonWF *simulation. Neutral mutations
> normally take much longer to reach equilibrium than deleterious
> mutations, and I want to avoid running long simulations with both
> mutation types. So, I have been doing
>
> a. Run a simulation with no mutations
>
> b. Recapitate and mutate with (mu/generation time)
>
> c. Wrote a script to take the tree at time 0 and write a file in *slim
> output*
>
> d. Go back to slim (without tree recording) and read the slim output
> with neutral mutations (in theory in equilibrium)
>
> e. Include deleterious mutations from this time point and wait for those
> to reach ~equilibrium
>
> *Is this a valid approach? *I’m aware I’m losing all the genealogical
> information of the first run, but I’m not worried about that because I
> care more about what happens from step “d” onwards.
>
> cheers,
>
> Hernán
>
> --
> SLiM forward genetic simulation: http://messerlab.org/slim/
> <http://messerlab.org/slim/>
> ---
> You received this message because you are subscribed to the Google
> Groups "slim-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send
> To view this discussion on the web visit

--
Miguel de Navascués

UMR CBGP, INRAE
Centre de Biologie pour la Gestion des Populations
755 avenue du campus Agropolis
CS30016
34988 Montferrier-sur-Lez cedex (France)

phone: +33499623370
fax: +33499623345
e-mail: miguel.navascues AT inrae.fr

### Hernan Morales

Sep 22, 2021, 11:08:46 AM9/22/21
to Miguel de Navascués, slim-discuss
Hi Miguel,

Thanks for your thoughts. All excellent points.

Re #1. Yes, I also thought calculating Ne based on the pedigree (or the genealogy in this case) was a good option. I want to simulate inbreeding later on and was worried it would distort too much, but should be fine to do it in the ancestral population. I agree with you this is the way to go. as a less circular approach, as you said. Although I think people simply use the initial population size as Ne for recapitation, right? I have never seen a huge effect of the Ne parameters on the recapitation function, I might be wrong.

Re #2. The reason is that I want to collect timepoints along the simulation while having both mutation types, so their different equilibrium points is a problem and also running treeseq recording with deleterious mutations + remembering individuals is incredibly slow and leads to huge tree files. So I rather doing it in two steps and avoid treeseq for the second step, this is why I came up with this solution. I gather you don't see a reason not to do it this way in terms of violating an assumption of the models.

cheers,

Hernán

To unsubscribe from this group and stop receiving emails from it, send an email to slim-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/slim-discuss/5b58115f-9db4-a735-86b9-3caa71803254%40inrae.fr.

--
Hernan

### Peter Ralph

Sep 22, 2021, 5:29:22 PM9/22/21
to hern....@gmail.com, slim-discuss
Good questions!

> 1) When doing nonWF models we often want to estimate the effective population size and figure out the N:Ne ratio. For this I normally run a long-ish simulation to achieve near equilibrium and estimate Ne=4/pi*mu. I want to do this step with treeseq instead, so I’m thinking:
> a. Calculate generation time of simulation
> b. Run a simulation with no mutations
> c. Recapitate and mutate with (mu/generation time)
> d. Calculate pi at time 0
> e. Question, should I do:
> i. Ne=4/pi*mu or
> ii. Ne=4/pi*(mu/generation time)

The answer here comes from checking what the units are on all these
quantities. In the expression for neutral diversity in a randomly
mating population,
pi = 4 Ne mu,
pi is in units of mutations per bp, 4Ne is in units of generations
(since it's twice the mean coalescence time), and mu is in units of
mutations per bp per generation. Since msprime puts down mutations at
a rate of "mutations per time unit", your step (c) with mutation rate
equal to (mu / generation time) is the right one, so that mu is in
units of mutations per generation. So, I think the right answer is
(i). But, this is always confusing - you can verify this by doing this
procedure to a neutral WF model.

> 2) My second question is about doing a bit of an unconventional series of steps to do a non-neutral nonWF simulation. Neutral mutations normally take much longer to reach equilibrium than deleterious mutations, and I want to avoid running long simulations with both mutation types. So, I have been doing
> a. Run a simulation with no mutations
> b. Recapitate and mutate with (mu/generation time)
> c. Wrote a script to take the tree at time 0 and write a file in slim output
> d. Go back to slim (without tree recording) and read the slim output with neutral mutations (in theory in equilibrium)
> e. Include deleterious mutations from this time point and wait for those to reach ~equilibrium
> Is this a valid approach? I’m aware I’m losing all the genealogical information of the first run, but I’m not worried about that because I care more about what happens from step “d” onwards.

I'm not sure what your goal is? Or, what you mean by "valid"? I guess
that the goal is to run part of your simulation without tree sequence
recording turned on? This should work fine for that, I agree, but feel
like I'm missing something.

### Hernan Morales

Sep 27, 2021, 5:51:30 PM9/27/21
to Peter Ralph, slim-discuss
Hi Peter,

Thanks for the reply. Sorry for the slow reply I wanted to test a few things to report back. Yes, what you say makes sense and indeed the right match is (i), so mutating the tree with (mu/generation time) but calculating Ne=4/pi*mu.

What I tried was
(1) running the simulation with slim - result: Ne/N = 0.6076236
(2) with treeSeq without mutations and mutating in pyslim - result:  Ne/N = 0.5964987 (note that recapitate does not ask for Ne parameter anymore)
(3) recording 100 steps with treeSeq and asking how many parent nodes per step there are in average  - result:  Ne/N = 0.5247862
(4) recording the pedigree for 100 steps and using optiSel in R to get Ne - result:  Ne/N = 0.6283925

So all results are sort of similar, I think solution #3 underestimates Ne because some steps come back with very few parents (effect of overlapping generations?), at least with my code, that might be wrong?:

import msprime, pyslim
import numpy as np
from collections import Counter
TREE="pedigree_seed444_gen200_genTime.tree"
nodes=ts.tables.nodes
parent_nodes = ts.tables.edges.parent
parentTime=nodes[parent_nodes].time
freq = Counter(parentTime)
Ne=np.array(list(freq.values())).mean()
N=len(ts.individuals_alive_at(0))
Ne/N

Anyways, my question was about how appropriate was to use solution #2 and it seems that its good!

cheers,

Hernán
--
Hernan

### Peter Ralph

Sep 27, 2021, 6:37:54 PM9/27/21
to Hernan Morales, slim-discuss
Hi, Hernán - thanks for reporting back! A few notes:

> (1) running the simulation with slim - result: Ne/N = 0.6076236
> (2) with treeSeq without mutations and mutating in pyslim - result: Ne/N = 0.5964987 (note that recapitate does not ask for Ne parameter anymore)

recapitate *does* take the Ne parameter; see the second paragraph at

> (3) recording 100 steps with treeSeq and asking how many parent nodes per step there are in average - result: Ne/N = 0.5247862

Counting number of parents doesn't actually get you an estimate of Ne.
(It should be in the right ballpark, though.) Really, what you want to
do is to find out how many children each individual had, and if the
vector of these number is C, compute
Ne = sum(C * (C-1))
(note that what you're computing is sum(C > 0).) To figure this out,
equate 1/2Ne to the probability that if you pick two genomes uniformly
at random that they inherit from the same parental genome in the
previous generation.

> So all results are sort of similar, I think solution #3 underestimates Ne because some steps come back with very few parents (effect of overlapping generations?), at least with my code, that might be wrong?:

> parent_nodes = ts.tables.edges.parent
> parentTime=nodes[parent_nodes].time

Hm, I see that in your code you're computing the number of parents
*born* at each time, which is different than the number alive at each
time. (And, I'm assuming you remembered everyone in every generation;
otherwise you're missing a bunch.) Oh wait - you're counting the times
of birth of each parent, once for each distinct bit of genome they
passed on. I'm not sure what that will do.

- peter

### Chrystelle Delord

Dec 8, 2021, 8:59:38 AM12/8/21
to slim-discuss
Dear Hernán and SLiM users,

I was wondering how, Hernán, what you ended up choosing to compute Ne in SLiM, and which Ne value you ended up using during the recapitation phase in pyslim/msprime?

I am currently dealing with similar issues and for now, I am using pedigree recording to get the demographic Ne from my nonWF population in SLiM.
(I am modelling a population that is in accordance with the Hill-Felsenstein model presented by Waples et al. (2011, doi: 10.1890/10-1796.1), so I could get the expected demographic Ne with the AgeNe software, the observed simulated Ne with Equation (2) from the Waples paper, and check the right concordance between the two values in order to be sure I am simulating everything the way I want to.)

However, I am trying to assess how accurate this would be to use that pedigree-based Ne to call pyslim/msprime for the recapitation phase.
My current thought is that it could be accurate under quite specific conditions (constant population size, low inter-individual variance in reproductive success, absence of strong subpopulation structure), otherwise I am not even sure that using the harmonic Ne across SLiMulated generations would be OK (maybe I am wrong?).
In your case, you said that you wanted to simulated inbreeding at some point: so, did you ended up using the pedigree-based Ne before inbreeding occurs to feed the recapitation model?

Another option I was thinking about is to choose recapitation-Ne myself (depending on the amont of diversity I want to generate for the ancestral msprime population) and then keep a "small burn-in" (e.g. ~2Ne) during the SLiM phase, Ne being the expected, pedigree-based Ne this time.

So, I would be very curious to know what you ended up doing in your case?
Thanks a lot :)

Chrys