Tracking IBD boundaries with and without full ARG

66 views
Skip to first unread message

pramesh shakya

unread,
Sep 28, 2021, 9:55:43 PM9/28/21
to msprime-users
Hello msprime community,

I'm a new user to msprime and was playing around with how the tree sequences differed with and without --record_full_arg option, specifically w.r.t to IBD boundaries. Since the number of trees change reduce when disabling the option and this affects the intervals of the trees as well which in turn creates disparity in the IBD boundaries found for some sample pairs using the two different trees. I was wondering how can this be made consistent ( I'm guessing map_nodes  can help with but not entirely sure) or if I'm missing something.

Also, the 'extra' nodes in the tree produced with the --record_full_arg option are still recombined nodes even though they don't have the NODE_IS_RE_EVENT flag right ?
I apologize in advance if these are naive questions.

Best,
Pramesh

Peter Ralph

unread,
Sep 29, 2021, 11:36:18 AM9/29/21
to pramesh shakya, msprime-users
Hi, Pramesh -

I'm probably missing something, but I'm not sure that full_arg does
affect IBD segments? IBD is called not based on tree boundaries but
based on ancestral node boundaries; doing full_arg does not actually
change the MRCA nodes, it only adds some extra edges and nodes that
don't affect coalescences; and it's coalescences that determine IBD.

A simple way to check is whether calling simplify() on the tree
sequence recorded with full_arg and then finding IBD will give you the
same IBD. If doing full_arg *does* affect IBD, then it should still be
the case that the two operations: (1) simulate and (2)
simulate(full_arg=True) -> simplify are equivalent. So, you can verify
by doing a simplify on a full_arg tree to see if it changes IBD.

I'm not sure what you mean by the second question? That would be a
good one for our discussions
(https://github.com/tskit-dev/msprime/discussions) - want to post it
there, with more details? Same for the first question, if you want to
discuss more.

- Peter

- peter
> --
> You received this message because you are subscribed to the Google Groups "msprime-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to msprime-user...@googlegroups.com.
> To view this discussion on the web, visit https://groups.google.com/d/msgid/msprime-users/6b5964ed-5405-413b-bdda-2c6f0215bc79n%40googlegroups.com.

pramesh shakya

unread,
Oct 1, 2021, 5:08:26 PM10/1/21
to msprime-users
Hi Peter,

Thank you for the quick reply. 

1) So for the first case of IBD boundaries, here's the code I ran with the --record_full_arg option and then simplified :

num_samples = 4
ts = msprime.sim_ancestry(num_samples, sequence_length=10,
                          recombination_rate=0.1, random_seed=3,
                          record_full_arg=True, ploidy=2)
version=1
samples = ts.samples()
ts_simplified = ts.simplify(samples=samples)
ts_simplified.dump("data_dump-{}-simplified.ts".format(version))
with open("pic-{}-simplified.txt".format(version), 'w') as out:
    out.write(ts.draw_text())


and here's the tree that I get:
tree-simplified.JPG

Now, when I run the simulation without the full ARG option, I get the following tree sequences:
tree-noARG.JPG

Now, if you look at the IBDs for the sample pair (0, 4) in the two trees, they're not consistent because of the tree sequences boundaries as well as additional trees(in the --record_full_arg case).

IBDs in the version with ARG recorded(first tree) between samples 0 and 4:
Sample1  Sample2  StartPos  EndPos  MRCA
       0              4                 0             3.0        32      
       0              4                3.0           5.0        27     
       0              4                5.0           6.0        24     
       0              4                6.0           7.0        31  
       0              4                7.0          10.0        24    

Similary, IBDs in the version without ARG (second tree) between samples 0 and 4:
Sample1  Sample2  StartPos  EndPos  MRCA
   0                 4                 0             2.0        18     
   0                 4              2.0            5.0         16 
   0                 4              5.0          10.0        19 

So as per your advice, I suppose simplifying the trees still didn't work. 

2) As for the question regarding the 'extra' nodes. I was talking about the additional nodes in the first tree. As per the tutorial on ARG and tree sequences, when examining the nodes table, the extra nodes with NODE_IS_RE_EVENT flag on are recombination events. I also followed through github issue on detectable recombination events but I guess I still don't quite understand what those nodes are without the NODE_IS_RE_EVENT flag. For example, node 15 and 16 in the first tree look like they share parents and are a result of recombination, but none of the nodes in the node table had the NODE_IS_RE_EVENT turned on

I'll also take your advice and start up the discussion on the github page.
Thank you.

-Pramesh

pramesh shakya

unread,
Oct 4, 2021, 4:58:45 PM10/4/21
to msprime-users
Hi Peter,

So I just noticed that I was using the un-simplified tree sequence to output the picture of the tree sequences, so below is the correct picture of the tree sequences after the ARG recorded tree has been simplified.tree-simplified.JPG
As you can see this is still inconsistent with the tree sequences output without the --record_full_arg flag. Using this simplified tree, the IBDs shared between samples 0 and 4 would be :

Sample1  Sample2  StartPos  EndPos  MRCA
      0               4               0              3             18
      0               4               3              5             16
      0               4               5              6             15
      0               4               6              7             17
      0               4               7              10           15
Reply all
Reply to author
Forward
0 new messages