Serial sampling and tree sequences

Tom McMahon

unread,

Oct 30, 2020, 11:37:21 AM10/30/20

to slim-discuss

Hi there,

I am attempting to generate a nonWF simulation of a haploid population (with no recombination). I want to be able to sample the simulation at various timepoints and obtain the true, dated genealogy of the sampled individuals. I am using the tree-sequence output as you use in your recent paper adapting SLiM to haploid populations. However, the treeSeq output at the end of the simulation contains only the individuals remaining in the final generation and so I am unable to sample from earlier points. I could generate treeSeq outputs at various points through the simulation but I'm not sure how I could then take them and generate a single tree of samples from different timepoints. Any advice would be much appreciated.

Thanks,

Tom McMahon

MPhil Student, University of Cambridge

Yan Wong

unread,

Oct 30, 2020, 11:44:10 AM10/30/20

to Tom McMahon, slim-discuss

You want treeSeqRememberIndividuals() I suspect.

Yan

Ben Haller

unread,

Oct 30, 2020, 11:51:41 AM10/30/20

to slim-discuss

Indeed. Section 17.5 of the SLiM manual has an example of its use, with discussion. For your use case you will want to call treeSeqRememberIndividuals() at multiple time points, but it should work similarly otherwise.

Cheers,

-B.

Benjamin C. Haller

Messer Lab

Cornell University

Tom McMahon

unread,

Oct 30, 2020, 12:25:48 PM10/30/20

to slim-discuss

that's perfect, thanks!

Chrystelle Delord

unread,

Sep 29, 2021, 9:24:39 AM9/29/21

to slim-discuss

Dear all,

Please allow me to "reopen" that thread :)

I am also very interested in that issue of serial sampling, and this was very helpful (for anyone interested, the pyslim vignette is also very enlightening, see section "Historical individuals", https://tskit.dev/pyslim/docs/latest/tutorial.html#historical-individuals), thank you!

I just had a few follow-up questions (sorry if these are already covered in the Manual, I am still discovering it!):

- Say, I would like to draw a random sample of individuals every five generations from timepoint 75 to 100 during a forward SLiMulation of a nonWF model (and export some demographic information like age, sex, parents ID, etc).

(1) Could I also use the sampled individual metadata (i.e., their IDs) to pass to the treeSeqRememberIndividuals(..., permanent=T) option, within the same callback, to make sure these individuals will be kept in the final exported .trees file?

(i.e., the treeSeqRememberIndividuals(..., permanent=T) will also be called every five generations from timepoint 75 to 100 during the forward SLiM phase: is that OK?)

- Now, I would like to get the genomic information of the individuals I have randomly sampled in phase (1). For that, I want to use the recapitation>simplification>mutation process.

I suppose it is possible to use the nodes ID from each individual ever "remembered" (which should be available from the individual metadata as well, providing phase (1) went OK) to perform the SlimTreeSequence.simplify().

(2) But, if I want to extract those individual genotypes after simplification, is that enough to ensure that all their nodes will still be kept as 'samples', and I will be able to get the full genomic information for everyone in my vcf file?

- As the model is nonWF, I may have sampled the same individual several times across various timepoints.

(3) Could this represent a particular issue while calling the treeSeqRememberIndividuals() option?

Thank you very much for your feedback, and have a very nice day!

Chrys

Peter Ralph

unread,

Sep 29, 2021, 11:45:52 AM9/29/21

to Chrystelle Delord, slim-discuss

> I am also very interested in that issue of serial sampling, and this was very helpful (for anyone interested, the pyslim vignette is also very enlightening, see section "Historical individuals", https://tskit.dev/pyslim/docs/latest/tutorial.html#historical-individuals), thank you!

Thanks!

> - Say, I would like to draw a random sample of individuals every five generations from timepoint 75 to 100 during a forward SLiMulation of a nonWF model (and export some demographic information like age, sex, parents ID, etc).
>
> (1) Could I also use the sampled individual metadata (i.e., their IDs) to pass to the treeSeqRememberIndividuals(..., permanent=T) option, within the same callback, to make sure these individuals will be kept in the final exported .trees file?
> (i.e., the treeSeqRememberIndividuals(..., permanent=T) will also be called every five generations from timepoint 75 to 100 during the forward SLiM phase: is that OK?)

I'm not sure if this is what you're asking, but: you only have to
Remember individuals once to make sure they'll be in the output - once
remembered, always remembered.

> - Now, I would like to get the genomic information of the individuals I have randomly sampled in phase (1). For that, I want to use the recapitation>simplification>mutation process.
> I suppose it is possible to use the nodes ID from each individual ever "remembered" (which should be available from the individual metadata as well, providing phase (1) went OK) to perform the SlimTreeSequence.simplify().
>
> (2) But, if I want to extract those individual genotypes after simplification, is that enough to ensure that all their nodes will still be kept as 'samples', and I will be able to get the full genomic information for everyone in my vcf file?

Yes - everyone that you Remember (permanently; and note that
permanent=T is the default) is guaranteed to have their entire genome
represented in the output.

> - As the model is nonWF, I may have sampled the same individual several times across various timepoints.
> (3) Could this represent a particular issue while calling the treeSeqRememberIndividuals() option?

Yes, good thinking - what happens is that if you Remember them more
than once, their information (e.g. age, spatial location) will be
updated. Their genome won't change, of course, but if there's
something else that might change across multiple sampling events
you'll have to save that in some other way (eg put it in top-level
metadata). Suppose that there's an individual remembered both 10 and
15 generations before the end of the simulation; so when you do
ts.individuals_alive_at(10)
and
ts.individuals_alive_at(15)
they'll appear in both lists. Note that you'll want to extract only
those individuals whose nodes are *samples*; we ought to make this
easier somehow (e.g. https://github.com/tskit-dev/pyslim/issues/203 -
suggestions welcome).

Hope this helps!
-Peter

Chrystelle Delord

unread,

Sep 30, 2021, 2:51:58 AM9/30/21

to slim-discuss

Hi Peter,

Thank you very much, that helps a lot!

I figured out I had been missing a few things: I believed that it was completely

impossible to sample individuals from >0 time units without remembering them.

Actually (if I am not mistaken again!) it is always possible to access individuals

that were alive x time units ago with ts.individuals_alive_at(x) BUT we may not

be able to get the full genomic information for all of them (many of which will

e considered as non-samples), so this is when remembering them is useful.

Also, if I am not wrong, there might be two causes for an individual to be 'non-sample'

in the final recapited, simplified, mutated tree:

(a) if the individuals retained after simplification, and the individuals remembered

do not fully overlap (so we need to simplify with, as input, the node IDs of every individual

ever remembered during forward simulation).

(b) because some alive individuals with a non-sample node may be still retained after

simplification, so actually no matter what, it is mandatory to check that both individuals

nodes are samples before trying to output their genotypes, or to use the suggested

option ts.individuals_alive_at(x, samples_only=T) if available.

I suppose it would be possible to circumvent that feature when extracting genotypes,

e.g. by building the indivlist = [ ] of ts.write_vcf(vcffile, individuals=indivlist) with another

option than ts.individuals_alive_at(x, samples_only=T) ?

Especially in a case of serial sampling, where I would like to get the genotype of every

individual ever remembered even if not alive at time x?

I was thinking, maybe, saving a text file of all remembered individual IDs at each

sampling instance in SLiM, and binding these IDs into a single list ( indivlist = [ ] ) to pass

as argument in ts.write_vcf(vcffile, individuals=indivlist) in pyslim.

Again, thank you so much for your quick and thoughtful answer,

I am having the best time discovering SLiM/pyslim!

Warm greetings,

Chrys

Reply all

Reply to author

Forward