Groups
Groups

folded in dadi

817 views
Skip to first unread message

7745...@qq.com

unread,
Sep 3, 2014, 1:09:28 PM9/3/14
to dadi...@googlegroups.com
Hi, Ryan,

I am trying to use dadi, for making some inference of population genetics for my three population data set. Firstly, I'm really a newbie to this tools and feel sorry if the questions I asked sound very easy for you.

We use the "convert_vcf_to_dadi_input.pl script that was attached in dadi group by one of the dadi users  to get the input file. However, the second column ("OUT") in the output file by this script tend to be same as the reference sequence ( first column ). The original print command in the script is "my $line="$ref\t$ref\t$vcf{$id}{$pos}{ref}";" . We think it incorrect if the outgroup species is not the reference genome. Nevertheless, we use this script to covert our file because we don't get the outgroup sequence in our project. Here is that script that we used to infer FS :

import dadi
dd = dadi.Misc.make_data_dict('pop.list.data')
folded = dadi.Spectrum.from_data_dict(dd, ['linzhi','naqu'], [8,8], polarized=False)
import pylab
dadi.Plotting.plot_single_2d_sfs(folded, vmin=0.1)
pylab.show()

Because no outgroup information is presented, So I use the folded version, and "polarized" is set "False", however, the AFS result seems strange (figure_1.png), seems half of  the genome don't present SNPs. When folded is not used, we get the same rusult like figure_1.png, the script is:

import dadi
dd = dadi.Misc.make_data_dict('pop.list.data')
fs = dadi.Spectrum.from_data_dict(dd, ['linzhi','naqu'], [8,8], polarized=False)
import pylab
dadi.Plotting.plot_single_2d_sfs(fs, vmin=0.1)
pylab.show()

Compared both results of folded and unfolded, the "folded" seems work uneffectively. Is that the script I used right? By the way, I think the sample size you defined in the manual is two times of the individual numbers if the sample is diploid, am I right? But when I set projections=[10,10] because both of the populations have 5 individuals, I get the error "value error: minvalue must be less than or equal to maxvalue". I can't figure out the reason because I'm really not good at programming.
figure_1.png

Gutenkunst, Ryan N - (rgutenk)

unread,
Sep 4, 2014, 6:21:48 PM9/4/14
to dadi...@googlegroups.com
Hello unnamed dadi user,



On Sep 3, 2014, at 10:09 AM, 7745...@qq.com wrote:
We use the "convert_vcf_to_dadi_input.pl script that was attached in dadi group by one of the dadi users  to get the input file. However, the second column ("OUT") in the output file by this script tend to be same as the reference sequence ( first column ). The original print command in the script is "my $line="$ref\t$ref\t$vcf{$id}{$pos}{ref}";" . We think it incorrect if the outgroup species is not the reference genome.

Correct, the outgroup is not the reference genome. We didn’t write that script ourselves, so it’s unofficial.

Nevertheless, we use this script to covert our file because we don't get the outgroup sequence in our project. Here is that script that we used to infer FS :

import dadi
dd = dadi.Misc.make_data_dict('pop.list.data')
folded = dadi.Spectrum.from_data_dict(dd, ['linzhi','naqu'], [8,8], polarized=False)
import pylab
dadi.Plotting.plot_single_2d_sfs(folded, vmin=0.1)
pylab.show()

Because no outgroup information is presented, So I use the folded version, and "polarized" is set "False", however, the AFS result seems strange (figure_1.png), seems half of  the genome don't present SNPs.

The is correct for a folded FS. If the FS is folded, then the counts are the counts of the minor allele. It is impossible, with your number of samples, to have a SNP, for example, with a minor allele frequency of (7,7).

When folded is not used, we get the same rusult like figure_1.png, the script is:

import dadi
dd = dadi.Misc.make_data_dict('pop.list.data')
fs = dadi.Spectrum.from_data_dict(dd, ['linzhi','naqu'], [8,8], polarized=False)
import pylab
dadi.Plotting.plot_single_2d_sfs(fs, vmin=0.1)
pylab.show()

Compared both results of folded and unfolded, the "folded" seems work uneffectively. Is that the script I used right?

Your dadi commands appear correct. It is likely an issue with the input file. If in your data the reference allele (and thus the outgroup allele due to the conversion script) is always set to be the major allele, then you would get this result.

By the way, I think the sample size you defined in the manual is two times of the individual numbers if the sample is diploid, am I right? But when I set projections=[10,10] because both of the populations have 5 individuals, I get the error "value error: minvalue must be less than or equal to maxvalue". I can't figure out the reason because I'm really not good at programming.

I can’t debug this unless you send me the entire error, which would help reveal what the problem is. I’m happy to help if you send me the full error trace.

Best,
Ryan


--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To post to this group, send email to dadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/dadi-user.
For more options, visit https://groups.google.com/d/optout.
<figure_1.png>

--
Ryan Gutenkunst
Assistant Professor
Molecular and Cellular Biology
University of Arizona
phone: (520) 626-0569
http://gutengroup.mcb.arizona.edu

7745...@qq.com

unread,
Sep 11, 2014, 12:36:03 PM9/11/14
to dadi...@googlegroups.com
Hi, Ryan,

Sorry for my delay due to something. You are right, the script I used above is right, since I have seen the similar result from some published papers. These weeks I have carefully read the manual and examples again, and try to apply some probable models to my data set. Again, I get some questions that confused.

1: How to get the best topology among three populations by dadi. For example I got three population and there are four possible topology, (pop1(pop2,pop3)), (pop2(pop1,pop3)),     (pop3(pop1,pop2)),and the last one is ancestral population split into three populations simultaneously. I can't get the answer from the manual and examples to solve this situation maybe I misread something.

2: Another question is I want to setup a new model to my data, which the initial population go through a split, then the children populations begin grow exponential respectively and then,  they both suffered a bottleneck, then, one of the children population ( ancestral pop2) stated to split pop2 and pop3 due to some refugee barrier. So I try to imitate the examples to write the scripts. But I don't know how to add the bottleneck event after a simultaneously exponential event. Here is the script that I try to write down:

import numpy
from dadi import Numerics, PhiManip, Integration, Spectrum

def newmodel (...) (bypass)
   
    xx = Numerics.default_grid(pts)
    phi = PhiManip.phi_1D(xx)
    phi = PhiManip.phi_1D (xx, phi, nu = nuA) # nuA: ancestral population size, 

# ancestral population initial split.T1: the first split, T2: the time of the total population size peak that go throng a exponential growth.  nuQ2 and nuQ1 mean one of children population #size experience a exponential growth from nuQ1 to nuQ2. nuN2 and nuN1 represent another children population experience the same event. 
  
    phi = PhiManip.phi_1D_to_2D(xx, phi)
    
    nuQ2_func = lambda t: nuQ2 * (nuQ1/nuQ2)**(t/(T1-T2)
    nuN2_func = lambda t: nuN2 * (nuN1/nuN2)**(t/(T1-T2)

# then both of the children populations experience a bottleneck ( the following is right?  Imitate from one of topic in this group, seemly same to exponential event )
    nuQ3_func =  nuQ3*(nuQ2/nuQ3)**(t/T2-T3)  #T3: mean the time of end of bottleneck. nuQ3 and nuN3 mean the two children population size after experience a bottleneck event.
    nuN3_func =  nuN3*(nuN2/nuN3)**(t/T2-T3)

    phi = Integration.two_pops(phi, xx, T1, T2, T3, nuQ2_func, nuN2_func,  nuQ3_fun, nuN3_func,
                            then how to write the migration over time T1-T2 and T2-T3 between the two children populations)

# then one of children population split (bypass)

Sorry about the bad python programming, I'm really not good at python. It's very appreciate that you give me some suggestions.

Baolin
在 2014年9月4日星期四UTC+8上午1时09分28秒,7745...@qq.com写道:

Gutenkunst, Ryan N - (rgutenk)

unread,
Sep 19, 2014, 12:06:35 AM9/19/14
to dadi...@googlegroups.com
Hello Baolin,

Sorry for the slow reply. Very busy week. Responses below in your text.

Best,
Ryan

On Sep 11, 2014, at 9:36 AM, 7745...@qq.com wrote:
1: How to get the best topology among three populations by dadi. For example I got three population and there are four possible topology, (pop1(pop2,pop3)), (pop2(pop1,pop3)),     (pop3(pop1,pop2)),and the last one is ancestral population split into three populations simultaneously. I can't get the answer from the manual and examples to solve this situation maybe I misread something.

You’ll need to fit the four models to your data and compare likelihoods. It may be that they are very different and one model obviously fits better than the others. If so, it’s an easy choice.

If the best model isn’t obvious, you’ll want to do a formal model selection test. In this case, you could use something like AIC or BIC. But be careful because dadi’s likelihood isn’t the full likelihood; it’s a composite likelihood, which will tend to overestimate evidence for the more complex model.

2: Another question is I want to setup a new model to my data, which the initial population go through a split, then the children populations begin grow exponential respectively and then,  they both suffered a bottleneck, then, one of the children population ( ancestral pop2) stated to split pop2 and pop3 due to some refugee barrier. So I try to imitate the examples to write the scripts. But I don't know how to add the bottleneck event after a simultaneously exponential event. Here is the script that I try to write down:

import numpy
from dadi import Numerics, PhiManip, Integration, Spectrum

def newmodel (...) (bypass)
   
    xx = Numerics.default_grid(pts)
    phi = PhiManip.phi_1D(xx)
    phi = PhiManip.phi_1D (xx, phi, nu = nuA) # nuA: ancestral population size, 

# ancestral population initial split.T1: the first split, T2: the time of the total population size peak that go throng a exponential growth.  nuQ2 and nuQ1 mean one of children population #size experience a exponential growth from nuQ1 to nuQ2. nuN2 and nuN1 represent another children population experience the same event. 
  
    phi = PhiManip.phi_1D_to_2D(xx, phi)
    
    nuQ2_func = lambda t: nuQ2 * (nuQ1/nuQ2)**(t/(T1-T2)
    nuN2_func = lambda t: nuN2 * (nuN1/nuN2)**(t/(T1-T2)

You probably want to integrate for a time T1 - T2 here, to actually have the populations experience this history.

# then both of the children populations experience a bottleneck ( the following is right?  Imitate from one of topic in this group, seemly same to exponential event )
    nuQ3_func =  nuQ3*(nuQ2/nuQ3)**(t/T2-T3)  #T3: mean the time of end of bottleneck. nuQ3 and nuN3 mean the two children population size after experience a bottleneck event.
    nuN3_func =  nuN3*(nuN2/nuN3)**(t/T2-T3)

Now you’ll want to integrate for a time T2 - T3.

    phi = Integration.two_pops(phi, xx, T1, T2, T3, nuQ2_func, nuN2_func,  nuQ3_fun, nuN3_func,
                            then how to write the migration over time T1-T2 and T2-T3 between the two children populations)

# then one of children population split (bypass)

Sorry about the bad python programming, I'm really not good at python. It's very appreciate that you give me some suggestions.

Baolin
在 2014年9月4日星期四UTC+8上午1时09分28秒,7745...@qq.com写道:
Hi, Ryan,

I am trying to use dadi, for making some inference of population genetics for my three population data set. Firstly, I'm really a newbie to this tools and feel sorry if the questions I asked sound very easy for you.

We use the "convert_vcf_to_dadi_input.pl script that was attached in dadi group by one of the dadi users  to get the input file. However, the second column ("OUT") in the output file by this script tend to be same as the reference sequence ( first column ). The original print command in the script is "my $line="$ref\t$ref\t$vcf{$id}{$pos}{ref}";" . We think it incorrect if the outgroup species is not the reference genome. Nevertheless, we use this script to covert our file because we don't get the outgroup sequence in our project. Here is that script that we used to infer FS :

import dadi
dd = dadi.Misc.make_data_dict('pop.list.data')
folded = dadi.Spectrum.from_data_dict(dd, ['linzhi','naqu'], [8,8], polarized=False)
import pylab
dadi.Plotting.plot_single_2d_sfs(folded, vmin=0.1)
pylab.show()

Because no outgroup information is presented, So I use the folded version, and "polarized" is set "False", however, the AFS result seems strange (figure_1.png), seems half of  the genome don't present SNPs. When folded is not used, we get the same rusult like figure_1.png, the script is:

import dadi
dd = dadi.Misc.make_data_dict('pop.list.data')
fs = dadi.Spectrum.from_data_dict(dd, ['linzhi','naqu'], [8,8], polarized=False)
import pylab
dadi.Plotting.plot_single_2d_sfs(fs, vmin=0.1)
pylab.show()

Compared both results of folded and unfolded, the "folded" seems work uneffectively. Is that the script I used right? By the way, I think the sample size you defined in the manual is two times of the individual numbers if the sample is diploid, am I right? But when I set projections=[10,10] because both of the populations have 5 individuals, I get the error "value error: minvalue must be less than or equal to maxvalue". I can't figure out the reason because I'm really not good at programming.

--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To post to this group, send email to dadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/dadi-user.
For more options, visit https://groups.google.com/d/optout.

Elmira Mohandesan

unread,
Sep 29, 2014, 8:25:21 AM9/29/14
to dadi...@googlegroups.com
Dear dadi user,

I also have the same problem as you had. I have 3 pops (better say species), and one reference, and no outgroup.

I could not see what is the difference between the two script below. They both seem polarized.
would you please let me know how did you solve this problem eventually. specially I am interested in Ryan comments "If in your data the reference allele (and thus the outgroup allele due to the conversion script) is always set to be the major allele, then you would get this result", but I could not see your unfolded figure.
Cheers

Here is that script that we used to infer FS :

import dadi
dd = dadi.Misc.make_data_dict('pop.list.data')
folded = dadi.Spectrum.from_data_dict(dd, ['linzhi','naqu'], [8,8], polarized=False)
import pylab
dadi.Plotting.plot_single_2d_sfs(folded, vmin=0.1)
pylab.show()

Because no outgroup information is presented, So I use the folded version, and "polarized" is set "False", however, the AFS result seems strange (figure_1.png), seems half of  the genome don't present SNPs.

When folded is not used, we get the same rusult like figure_1.png, the script is:

import dadi
dd = dadi.Misc.make_data_dict('pop.list.data')
fs = dadi.Spectrum.from_data_dict(dd, ['linzhi','naqu'], [8,8], polarized=False)
import pylab
dadi.Plotting.plot_single_2d_sfs(fs, vmin=0.1)
pylab.show()

Gutenkunst, Ryan N - (rgutenk)

unread,
Sep 29, 2014, 9:28:57 AM9/29/14
to dadi...@googlegroups.com
Hello Elmira,

Both scripts below generate folded spectra. (That’s what polarized=False does.) From your message, it is not clear to me what the “same” problem is that you are having. Does the FS look unusual to you, or are you getting an error when you plot?

Best,
Ryan

--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To post to this group, send email to dadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/dadi-user.
For more options, visit https://groups.google.com/d/optout.

7745...@qq.com

unread,
Sep 29, 2014, 10:33:21 AM9/29/14
to dadi...@googlegroups.com
Hello Elmira,

I think what I make you confused are the two equations folded = dadi.Spectrum.from_data_dict(dd, ['linzhi','naqu'], [8,8], polarized=False) and fs = dadi.Spectrum.from_data_dict(dd, ['linzhi','naqu'], [8,8], polarized=False). Firstly, I would like to apologize for the low grade mistake that I have made in the first time I got to know python. After that I know I have misunderstanding the "floded" in the script because I initially think there is something special significance of this word  and can not changed when it get the value from model dadi, rather represent a variable. Actually, it's the same (fs ==folded).

As far as I konw, if your data was folded, that is mean that you don't konw the ancestral state. For example, you have got the vcf file after calling snp, you will don't konw wether the $ref bases is get from their ancestor or the $alt. At this situation, dadi will do the inference based on the minor allele requency. 

Maybe what I said is still not clear for you, I recommend a paper "Comparative and demographic analysis of orangutan genomes" for you in which the authors both consider the data is folded and not folded. 

Best wishes

Baolin

在 2014年9月29日星期一UTC+8下午8时25分21秒,Elmira Mohandesan写道:

Elmira Mohandesan

unread,
Sep 30, 2014, 4:49:42 AM9/30/14
to dadi...@googlegroups.com
Hi Ryan,

Yes my confusion was referring to two scripts that were both folded (all good now!).

I also get the FS plot that is half empty but not having outgroup and looking for minor allele frequency explains it.

Using the script below, when I change the numbers to 14 and 18 as it was described in the paper (Ryan et al. Plos Genetics 2009 - for diploid individuals) I get an error "value error: minvalue must be less than or equal to maxvalue" and I thought maybe in my vcf file I don't have two nucleotide per SNPs for each sample, eg one sample is C in one position and its missing the nucleotide in another allele (.).  


folded = dadi.Spectrum.from_data_dict(dd, ['Bactrian','Ferus'], [7,9], polarized=False)
folded.to_file('Bac_ferus_spectrum_folded.spec')
fs = dadi.Spectrum.from_file('Bac_ferus_spectrum_folded.spec')
fig = pylab.figure(1)
fig.clear()
dadi.Plotting.plot_single_2d_sfs(fs, vmin=1)
fig.savefig('2d_single_Bac_ferus.pdf')

Cheers,

Elmira
Reply all
Reply to author
Forward
0 new messages
Search
Clear search
Close search
Google apps
Main menu