Changing stacking policy in msprime.SLiMMutationModel(type=1)

Nathan Anderson

unread,

Mar 11, 2022, 11:56:32 AM3/11/22

to slim-discuss

Hi all

I am attempting to begin a SLiM simulation with standing variation generated in msprime.

I have been following along the vignette here, however I would like my simulation to have a stacking policy "f". I am wondering if there is a built in way of doing this.

Otherwise, I have written a work around where I remove rows from the mutation table if they occur at an occupied site. Please let me know if there is a better way to go about this.

tables = ots.tables
tables.mutations.clear()
sites = []
for mut in ots.tables.mutations:
if mut.site not in sites:
_ = tables.mutations.append(mut)
# else:
# break
sites.append(mut.site)

ots = tables.tree_sequence()

Thanks!

Nathan

Ben Haller

unread,

Mar 11, 2022, 12:36:09 PM3/11/22

to slim-discuss

Hi Nathan! This question is one for Peter, I imagine. From my end, I just have a question: *why* do you want a stacking policy of "f" for your msprime burn-in? That's unusual, and I wonder whether understanding why you want to do that might lead to suggestions for a better way to approach the problem you're trying to model. (Also, I'm just curious :->).

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University

Nathan Anderson

unread,

Mar 11, 2022, 3:35:05 PM3/11/22

to slim-discuss

Hi Ben

My goal is to benchmark different tests for selection based on allele frequency change vectors (eg cmh test) under different demographic histories. The analysis will be similar to studies such as this

The tests I am interested in assume biallelic snps. So I do not think stacking mutations would be appropriate because of this model's similarity with the infinite alleles model.

Furthermore, some of these methods attempt to infer selection coefficients from changes in allele frequencies. I believe that having two mutations stacked at the same loci would interfere with this inference.

For example, lets say that we have mutations A and a later mutation B is later stacked on top both with selection coefficient s, and we were interested in inferring mutation A's s value from its allele frequency trajectory. This inference would be biased upward because allele A is often, but not always (?) stacked with B.

As I am typing this, I realize you could have a very similar scenario with two beneficial mutations lying right next to each other in the genome and interfering with one another. I suppose I am just having trouble wrapping my head around what effect having multiple possible alleles segregating at a loci will have on tests based on allele frequency changes at biallelic snps.

Let me know your thoughts!

Thanks for your quick response!

Nathan

Ben Haller

unread,

Mar 11, 2022, 3:59:41 PM3/11/22

to slim-discuss

Hi Nathan,

Agreed that you don't want stacked mutations, yes. I'm not sure whether mutation overlay in Python with msprime always/never/sometimes gives stacked mutations, or how that is controlled; that's another question for Peter. :->

I guess you want a stacking policy of "f" in an attempt to force all SNPs to be biallelic, but that won't actually achieve that goal, I don't think, will it? If a mutation A occurs at a given site in a given genome, the policy of "f" would prevent genomes containing A from mutating again, to B, say, or to B stacked over A. But it would NOT prevent other genomes that do not contain A from mutating to B, or to C, etc. For this reason, it does not prevent multiallelic SNPs from existing, it ONLY prevents stacking. If you want to force all SNPs to be biallelic, you will need to jump through some other hoop; I don't know whether the right approach to that problem would be to use a custom mutational model of some kind in Python, or to post-filter the mutations that get overlaid (removing all but one mutation at any given position), or what.

So stacking policy "f" only prevents stacking, it does not prevent SNPs from becoming multiallelic. Furthermore, once that fact is established, policy "f" is also a rather strange policy that is rarely biologically realistic, because it prevents stacking from occurring by keeping the FIRST mutation to occur at a given position in a given genome – it blocks any new mutations from changing an already-mutated site (for no obvious biological reason). When the goal is to prevent stacking, policy "l" is almost always what is wanted, since it prevents stacking from occurring by keeping the LAST mutation to occur at a given position in a given genome – it replaces the old mutation at a site with the new mutation (which makes much more sense, biologically, and indeed is how nucleotide-based models in SLiM always work). I realize that you want policy "f" in order to prevent new mutations from making SNPs multiallelic, but as I explained above, it doesn't achieve that goal anyway. Given that it doesn't achieve that goal anyway, policy "l" is almost always what one wants, if one wants to prevent stacking.

Apologies if you already understood all of this. :-> I'm just trying to clarify the terminology and what different options actually achieve, for other readers as much as for you. :-> Anyhow, the short answer is that I don't think stacking policy is the right knob to control what you want to control, and Peter will be able to give you guidance regarding what the correct knob might be. :-> And apologies, also, if I have misunderstood what you wrote and am totally off-base!

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University

Nathan Anderson

unread,

Mar 11, 2022, 5:14:01 PM3/11/22

to slim-discuss

Hi Ben

Thank you for your detailed response. I had not fully thought through the implications of keeping the first mutation to appear canonically in this scenario.

I suppose the easiest solution would be to choose the target of selection from among the biallelic loci.

I'll reach out to Peter for his thoughts!

Thank you for your time!

Nathan

Miguel de Navascués

unread,

Mar 12, 2022, 5:27:14 AM3/12/22

to slim-d...@googlegroups.com

Hi Nathan,

Another way of looking at this: non-biallelic SNP occur in real data and
are filtered from the data if the analysis requires it. You could just
filter them from your simulated data and, in some way, your simulations
will be closer to reality. You might loose some targets of selection,
but that could be happening in real data too.

Best,

Miguel

On 11/03/2022 23:14, Nathan Anderson wrote:
> Hi Ben
>
> Thank you for your detailed response. I had not fully thought through
> the implications of keeping the first mutation to appear canonically in
> this scenario.
>
> I suppose the easiest solution would be to choose the target of
> selection from among the biallelic loci.
>
> I'll reach out to Peter for his thoughts!
>
> Thank you for your time!
> Nathan
>

--
Miguel de Navascués

UMR CBGP, INRAE
Centre de Biologie pour la Gestion des Populations
755 avenue du campus Agropolis
CS30016
34988 Montferrier-sur-Lez cedex (France)

phone: +33499623370
fax: +33499623345
e-mail: miguel.navascues AT inrae.fr

Peter Ralph

unread,

Mar 14, 2022, 12:27:08 AM3/14/22

to slim-d...@googlegroups.com

My input is that although it is certainly possible to do what you
want, it is rather fiddly, and Miguel has the correct answer, in my
view:

> Miguel de Navascués wrote:
> Another way of looking at this: non-biallelic SNP occur in real data and
> are filtered from the data if the analysis requires it. You could just
> filter them from your simulated data and, in some way, your simulations
> will be closer to reality. You might loose some targets of selection,
> but that could be happening in real data too.

Also, remember that the proportion of sites at which such things
happen will be small (if you are simulating with realistic
parameters).

Nathan Anderson

unread,

Mar 21, 2022, 3:49:35 PM3/21/22

to slim-discuss

Hi all

Thank you for your suggestions. You're correct that the pipeline I've taken with empirical data includes filtering for biallelic sites. And all of my test runs with msprime have resulted in a small proportion of multiallelic sites. I agree filtering these sites is the best course of action for my simulated data as well.