Calling ROH using msprime + pyslim

Arslan Zaidi

unread,

Feb 23, 2021, 2:20:53 PM2/23/21

to msprime-users

Hi all,

First of all, thank you for all your work on these amazing packages. They've really opened up a world of possibilities in popgen.

The problem: I am trying to call runs of homozygosity (ROH) and I think I'm running into issues with assigning haplotypes to the right individuals. My approach:

1. I read simulated trees from SLIM (which does track diploid individuals as opposed to msprime) using pyslim.

2. Based on pyslim's documentation, it looks like the node indices of the ith individual should be (2*i, 2*i + 1). Please tell me if this isn't correct. I confirmed this by sampling a few individuals and checking using ts.individual.nodes.

3. For each individual, I subset the tree, selecting only their haplotypes (nodes) and then selected trees which span a region of > 3Mb (I'm interested only in long runs).

4. Output the length of these trees.

This should work I think but the distribution I'm getting looks more like the distribution of inter-individual IBD instead of intra-individual IBD (ROH) - confirmed independently by running germline (which produces the correct results) on the same genomes. One way this would happen is if the nodes were not paired correctly within each individual but randomized across the sample.

I know I couldn't do this with msprime, which tracks haploid genomes rather than diploid individuals. But post-SLIM trees imported through pyslim should pair the correct hapltypes/nodes together, right?

Any insight/troubleshooting suggestions would be helpful

Thank you!

################### code ########################

import msprime
import numpy as np
import pyslim

#read tree file from slim
slim_ts = pyslim.load(args.tree).simplify()

#define function to call IBD (>3Mb) from tree given a set of nodes (haplotypes)
def callroh3(nodes):
simp_tree = slim_ts.simplify(samples = nodes)

roh3 = [tree.span for tree in simp_tree.trees() if tree.span > 3e6]
if len(roh3) > 0:
return roh3
else:
return None

#sample args.nsamples individuals
ix = np.random.randint(low = 0, high = slim_ts.num_individuals, size = args.nsamples)

#iterate over these individuals and get length of long ROH segments
longroh = [callroh3( (2*i , 2*i + 1) ) for i in ix]
longroh = [x for x in longroh if x != None]

################### code end ######################

Peter Ralph

unread,

Feb 23, 2021, 3:07:55 PM2/23/21

to Arslan Zaidi, msprime-users

> 2. Based on pyslim's documentation, it looks like the node indices of the ith individual should be (2*i, 2*i + 1). Please tell me if this isn't correct. I confirmed this by sampling a few individuals and checking using ts.individual.nodes.

This is correct as long as you aren't doing certain things in SLiM,
but it's not guaranteed, and so it's good practice not to rely on
this, and get nodes by (a) pulling out the individuals, and (b)
getting their nodes. For instance (not tested):

ix = np.random.sample(np.arange(slim_ts.num_individuals, size=args.nsamples)
ix_nodes = [ts.individual(i).nodes for i in ix]
longroh = [callroh(*n) for n in ix_nodes]

> 3. For each individual, I subset the tree, selecting only their haplotypes (nodes) and then selected trees which span a region of > 3Mb (I'm interested only in long runs).
>
> 4. Output the length of these trees.

Hm: I don't see anything wrong with your code. Conceptually this is a
bit different than the output of germline because germline is looking
for long runs of IBS; so they'll differ in that (a) there will be some
IBS that extends outside of each of the long segments of recent shared
ancestry you have found, and (b) in principle, you might have a long
segment of shared ancestry that happens to have a bunch of mutations
on it anyhow (eg because it's older than usual for that length). I
don't know if these are enough to produce the differences you are
seeing.

You may also want to look at Georgia Tsambos' methods for doing this
sort of thing:
https://tskit.readthedocs.io/en/latest/python-api.html#tskit.TableCollection.link_ancestors

Also, if you've more questions here, perhaps we could move it to a discussion:
https://github.com/tskit-dev/msprime/discussions

--peter

Arslan Zaidi

unread,

Feb 23, 2021, 3:17:17 PM2/23/21

to msprime-users

Ok good to know. Let me try explicitly calling the nodes based on individual IDs and also try Georgia's method. I will post an update either way just to resolve this.