simulating gene trees and sequences within a given species tree

146 views
Skip to first unread message

Jake

unread,
Jun 16, 2011, 3:27:00 PM6/16/11
to DendroPy Users
Is it possible to simulate gene trees, and then DNA sequences on those
gene trees, given a fixed species tree with branch lengths? Tree is
in newick format.

Thanks!
Jake

Jeet Sukumaran

unread,
Jun 16, 2011, 3:42:16 PM6/16/11
to DendroPy Users
Hi Jake,

Basically, this is done as follows:

#! /usr/bin/env python

import dendropy
from dendropy import treesim

# read the species tree
sp_tree = dendropy.Tree.get_from_string("[&R] (((A:2, B:2):1,(C:1,
D:1):2):1,E:4);",
"newick")

# set the number of individuals sampled from each species
for leaf in sp_tree.leaf_iter():
leaf.num_genes = 5 # can be any number you want

# if your branch lengths are NOT in coalescent units
# and are given in, e.g. generations, you will need
# to set the population sizes of the edges
#for edge in sp_tree.postorder_edge_iter():
# edge.pop_size = 1.0 # 1.0 => branch lengths in coalescent
units

# Simulate a gene tree within the species tree.
# `gene_tree` will be the constrained/censored/truncated gene tree
# `mapped_sp_tree` is a *clone* of the original input species
tree,
# but with the gene tree nodes as attributes of its
# nodes, so you can see where on the species tree they coalesce
etc.
# You can ignore it if you are not using this information.
gene_tree, mapped_sp_tree = treesim.constrained_kingman(sp_tree)

# show it!
print(gene_tree.as_string("newick"))
print(gene_tree.as_ascii_plot())

You do not mention what units your branch lengths are in. If
unspecified, then the population is taken as "1.0", which means that a
branch length of "1.0" corresponds to 1 (haploid) coalescent unit. If
this is not the case, you will have to either set the population size
on each edge correctly, as shown in the commented-out code above, or
adjust your branch lengths.

There are lots of other options, for using a particular set of taxa or
taxon labels for the gene trees, etc. etc. Let me know if you want to
do some of this.

Jeet Sukumaran

unread,
Jun 16, 2011, 3:53:54 PM6/16/11
to DendroPy Users
To generate sequences using, for e.g., the HKY model, after you have
your gene tree, `gene_tree`:

from dendropy import seqsim
dna = seqsim.generate_hky_characters(
seq_len=1000,
tree_model=gene_tree,
mutation_rate=1.0,
kappa=1.0,
base_freqs=[0.25, 0.25, 0.25, 0.25])
print(dna.as_string("fasta"))


On Jun 16, 2:27 pm, Jake <essel...@ku.edu> wrote:

Jake

unread,
Jun 16, 2011, 4:24:56 PM6/16/11
to DendroPy Users
Thanks Jeet! Works perfectly. The natural next question is: Is it
possible to simulate DNA sequences on the gene tree?

Cheers,
Jake

Jeet Sukumaran

unread,
Jun 17, 2011, 3:34:56 PM6/17/11
to DendroPy Users
Did you see the second post? Putting it altogether:

#####################################################

#! /usr/bin/env python

import dendropy
from dendropy import treesim
from dendropy import seqsim

# read the species tree
sp_tree = dendropy.Tree.get_from_string("[&R] (((A:2, B:2):1,(C:1,
D:1):2):1,E:4);",
"newick")

# set the number of individuals sampled from each species
for leaf in sp_tree.leaf_iter():
leaf.num_genes = 5 # can be any number you want

# if your branch lengths are NOT in coalescent units
# and are given in, e.g. generations, you will need
# to set the population sizes of the edges
#for edge in sp_tree.postorder_edge_iter():
# edge.pop_size = 1.0 # 1.0 => branch lengths in coalescent
units

# Simulate a gene tree within the species tree.
# `gene_tree` will be the constrained/censored/truncated gene tree
# `mapped_sp_tree` is a *clone* of the original input species
tree,
# but with the gene tree nodes as attributes of its
# nodes, so you can see where on the species tree they coalesce
etc.
# You can ignore it if you are not using this information.
gene_tree, mapped_sp_tree = treesim.constrained_kingman(sp_tree)

# simulate sequences
dna = seqsim.generate_hky_characters(
seq_len=1000,
tree_model=gene_tree,
mutation_rate=1.0,
kappa=1.0,
base_freqs=[0.25, 0.25, 0.25, 0.25])

# show sequences
# can also be saved with: `dna.write_to_path()`
print(dna.as_string("fasta"))

#####################################################

Note that an alternate approach would be to output the tree string, as
in the first example, and then use that as input to seq-gen. This will
be faster than using the Python-based sequence generation. If you have
PySeqGen installed (https://github.com/jeetsukumaran/PySeqGen), you
will have the best of both worlds: working within Python completely,
but the speed of the C-based seq-gen computation.

-- jeet
Message has been deleted

ca

unread,
Jun 25, 2021, 4:07:31 PM6/25/21
to DendroPy Users
Are there other models, apart from the HKY model, to generate DNA sequences from trees in dendroPy? If so, where can I find them?

Thanks!
Reply all
Reply to author
Forward
0 new messages