Missing Data in tskit overlay

75 views
Skip to first unread message

Megan Smith

unread,
Jul 8, 2020, 2:21:15 PM7/8/20
to slim-discuss
Hi,

I am trying to simulate a scenario of background selection in SLiM and then overlay neutral mutations using pyslim and msprime. Things seem to be working fine, except that, in about 5% of replicates, there will be missing data in the genotype matrix. I'm not sure why this would be the case; I am simplifying the tree sequence to only contain 10 alive individuals per population before overlaying neutral mutations. I am attaching my .slim script, the python script, and an example of an output that contains a missing data value.
 
Any advise you have would be appreciated! Thanks in advance. 

Best,
Megan
nomig_deleterious_scaled_1_overlaid_ms.out
nomig_deleterious_scaled.slim
neutral_tskit_to_msout_topost.py

Peter Ralph

unread,
Jul 8, 2020, 6:11:37 PM7/8/20
to Megan Smith, slim-discuss
Hi, Megan!

Gee, that's a problem - there should not be missing data. But, I'm not
able to reproduce the problem. I've just run this:
for k in $(seq 40); do python3 neutral_tskit_to_msout_topost.py yyy
nomig_deleterious_scaled.slim xxx $k; done
and not gotten any errors. Maybe I'm not running it correctly? Your
code is very nice and tidy, btw, and I didn't see any errors, but also
as far as I can tell the `xxx` and `yyy` aren't doing anything, so...
maybe I did something wrong?

It would be most helpful if you can send the random seed (which I see
you are recording: excellent) from the SLiM run that produces the
error.

Also: what version of pyslim do you have? And, tskit? If they are not
up-to-date there may a problem there (but nothing is springing to
mind).

thanks,
peter
> --
> SLiM forward genetic simulation: http://messerlab.org/slim/
> ---
> You received this message because you are subscribed to the Google Groups "slim-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to slim-discuss...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/slim-discuss/8b498c8e-7917-44ec-b1cb-fe92d017081ao%40googlegroups.com.

Megan Smith

unread,
Jul 9, 2020, 8:38:29 AM7/9/20
to Peter Ralph, slim-discuss
Hi Peter,

Thanks for the quick response! 

The xxx and yyy don't do anything currently; that's correct. That was my attempt to use flags for the parameters so they could be input in any order, which failed (and right now they are read in absolute order). Just haven't removed because I was planning to revisit. But the way you ran the code should work fine.

The random seed that produced the example that I sent was 
Final random seed: 1809121419536.

My versions are:
pyslim = 0.401
tskit = 0.2.3

Python 3.7.6

SLiM version 3.4


Thanks again!


Best,

Megan


Megan L. Smith, PhD
Postdoctoral Researcher
Department of Biology
Indiana University


Peter Ralph

unread,
Jul 9, 2020, 11:21:04 AM7/9/20
to Megan Smith, slim-discuss
A follow-up for the list, after some offline discussion: it seems that
Megan has turned up a bug, apparently in simplify:
https://github.com/tskit-dev/tskit/issues/714
... which is very surprising, given the amount of use it has got.
We're working on tracking it down.

-- peter

On Wed, Jul 8, 2020 at 11:21 AM Megan Smith <megan...@gmail.com> wrote:
>

Peter Ralph

unread,
Jul 9, 2020, 12:53:24 PM7/9/20
to Megan Smith, slim-discuss
Follow-up to the follow-up: this was NOT a bug, after all, just a
conceptually very tricky thing to spot.

The issue, is that you've not recapitated, and so there are (rare)
places on the genome where the whole genome has not coalesced. Since
you're looking at genotypes, mostly this just shows up as having a bit
less diversity than you would otherwise (since any mutations
segregating in the initial population -- i.e., that would have
occurred on the missing ancestral history -- are missing). However,
this node at this position was in the unlikely situation of being all
by itself - it inherited from a node in the first generation that
no-one else did. When you simplified, that node's ancestor got
removed, and so it was a node with no parent at that position. And,
that happens to be how tskit represents missing data.

So: the solution is to recapitate first!

-- peter
Reply all
Reply to author
Forward
0 new messages