projection & model selection

512 views
Skip to first unread message

lei zhang

unread,
Aug 6, 2016, 8:52:46 AM8/6/16
to dadi-user

Dear Ryan and dadi-users:

         I’m new to the dadi. I am working with a diploid crop specie (lettuce). The populations are different horticultural types, for example, crisphead, butterhead or romaine. I'm trying to fit the 1D models to each of the populations before moving onto the 2D models. The SNPs are called from the RNAseq and only synonymous SNPs (intergenic, intron, and synonymous SNPs) were used in the dadi analysis. Due to the diversity of different gene expression, there are a lot of missing SNPs. So I project down my sample size using fs.S() to maximize the number of segragating SNPs. For example, butterhead population including 27 individuals; different numbers were used to project down sample size (The first attachment is the segregating SNPs on a specified sample size to project down), and then the files were used to plot FS. Using the sample size as 54 (without projecting down), the population displayed a zigzag trend, but using sample size as 20 (in this case, it has the maximize number of S), the population exhibit normal. To compare the results under the different sample size, I use sample sizes as 20, 30, 40 and 54 to fit the five 1D models implemented in the dadi (see the second attachment for the inferred parameters). When sample size was 20, the results showed that four models (growth, bottlegrowth, two_epoch and three_epoch) have the almost same likelihood score but get very different parameter values. As the sample size increase, the difference of the likelihood score between these four model become larger but not significant. According the literature, butterhead-like form plants appeared about five hundred years ago and it must be have a bottleneck history. But the results cannot support the bottleneck model (The parameter values inferred by the bottlegrowth model reached the limit). I cannot determine which model is the best. If you could answer a few questions for me:

(1)   Should I project down the sample size?

(2)   If yes, which sample size should I use?  As you can see, in the first attachment, there was no significant difference when sample size from 10 to 40 with respect to the number of segragating SNPs. I want to know will it have an effect on the dadi performance when project down to a much smaller sample size (i.e. 20; the population have 27 individuals)?

(3)   How to choose the best model? Is there something wrong with my data ?

       Any help would be greatly appreciated. 

      Thanks!

Lei zhang

butterhead.max_S.log
project_20-30-40-54.parameters.txt
butterhead_20_combined.pdf
butterhead_30_combined.pdf
butterhead_40_combined.pdf
butterhead_54_combined.pdf

lei zhang

unread,
Aug 6, 2016, 9:30:06 AM8/6/16
to dadi-user
Hi Ryan,
    Sorry for bothering again, I have another question. 
    Should I remove linked SNPs(SNPs located in a LD block)?

Gutenkunst, Ryan N - (rgutenk)

unread,
Aug 8, 2016, 12:54:43 PM8/8/16
to dadi...@googlegroups.com
Hello Lei,

My first concerns it the “zig-zag” pattern you see in the non-projected data. Could that be because of inbreeding in your samples, leading to excess homozygosity? Dadi isn’t set up to model that, and that might cause big problems in the fitting. Projecting down may obscure this effect (because projecting “smears” the spectrum), but it’s still there. If possible, you should try and eliminate that from the data. You might check whether some samples are more inbred than others and can be dropped, or whether it’s possible to only take one allele from each sample. It might also be an issue with the SNP calling creating excess homozygosity somehow.

Once you get that sorted out, I wouldn’t project down very far, because you’re interested in recent events. Recent events are mostly reflected in rare alleles, so projection will particularly hurt your power there.

You may find that you don’t have power to detect a recent bottleneck with the data you have. Is 500 years equal to 500 generations in lettuce?

Best,
Ryan

--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To post to this group, send email to dadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/dadi-user.
For more options, visit https://groups.google.com/d/optout.
<butterhead.max_S.log><project_20-30-40-54.parameters.txt><butterhead_20_combined.pdf><butterhead_30_combined.pdf><butterhead_40_combined.pdf><butterhead_54_combined.pdf>

--
Ryan Gutenkunst
Assistant Professor of Molecular and Cellular Biology, University of Arizona
phone: (520) 626-0569, office: LSS 325, web: http://gutengroup.mcb.arizona.edu

Latest papers: 
“Selection on network dynamics drives differential rates of protein domain evolution”
PLoS Genetics; http://dx.doi.org/10.1371/journal.pgen.1006132
"Triallelic population genomics for inferring correlated fitness effects of same site nonsynonymous mutations"
Genetics; http://dx.doi.org/10.1534/genetics.115.184812
"Whole genome sequence analyses of Western Central African Pygmy hunter-gatherers reveal a complex demographic history and identify candidate genes under positive natural selection"
Genome Research; http://dx.doi.org/10.1101/gr.192971.115

Gutenkunst, Ryan N - (rgutenk)

unread,
Aug 8, 2016, 12:55:08 PM8/8/16
to dadi...@googlegroups.com
You shouldn’t need to. The effects of linkage can be taken care of in the uncertainty analysis.

--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To post to this group, send email to dadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/dadi-user.
For more options, visit https://groups.google.com/d/optout.

lei zhang

unread,
Aug 9, 2016, 1:10:10 PM8/9/16
to dadi-user

Hi Ryan,

 

     Thanks a lot for your answer. The generation time of lettuce is one year. The lettuce we sequenced were inbred lines, most of the loci are homozygous. As you suggested, I only take one allele from each accession and then plot the FS for every horticultural types with non-projected data. None of them displayed a zigzag trend. But they displayed fluctuations at the middle of the x axis except serriola population (wild relatives of cultivated lettuce). If I projected down to a smaller sample size, the fluctuations disappeared. But as you said, it’s still there.

    I want to know if these fluctuations are normal. It is noted that some of the accessions have some degree of kinship in the population. I calculated the pair-wise genetic similarity among these accessions based on the SNP data. Some of them are almost same at the genotype level, for example, there is only 1292 SNPs between accession A and B (the coverage of RNAseq is 30Mb). Is this the reason why they the displayed fluctuation on the FS? Should I remove samples that have high level of genetic similarity with others?

 

Many thanks in advance

Lei zhang


在 2016年8月9日星期二 UTC+8上午12:54:43,Ryan Gutenkunst写道:
all_fs_with_non-projected data.pdf
all_fs_with_projected data.pdf

Gutenkunst, Ryan N - (rgutenk)

unread,
Aug 11, 2016, 1:49:03 PM8/11/16
to dadi...@googlegroups.com
Hello Lei,

On Aug 9, 2016, at 10:10 AM, lei zhang <zhangl...@gmail.com> wrote:

     Thanks a lot for your answer. The generation time of lettuce is one year. The lettuce we sequenced were inbred lines, most of the loci are homozygous. As you suggested, I only take one allele from each accession and then plot the FS for every horticultural types with non-projected data. None of them displayed a zigzag trend. But they displayed fluctuations at the middle of the x axis except serriola population (wild relatives of cultivated lettuce). If I projected down to a smaller sample size, the fluctuations disappeared. But as you said, it’s still there.

The projection is really about maximizing the number of SNPs you can analyze to deal with missing data. Don’t try to use it to mask problems in the data.

    I want to know if these fluctuations are normal. It is noted that some of the accessions have some degree of kinship in the population. I calculated the pair-wise genetic similarity among these accessions based on the SNP data. Some of them are almost same at the genotype level, for example, there is only 1292 SNPs between accession A and B (the coverage of RNAseq is 30Mb). Is this the reason why they the displayed fluctuation on the FS? Should I remove samples that have high level of genetic similarity with others?

dadi assumes that the data represent random samples from the population. Relatedness within your samples will bias the results, so you should remove the related individuals (leaving only one individual from each set of relatives).

Best,
Ryan
<all_fs_with_non-projected data.pdf><all_fs_with_projected data.pdf>
Reply all
Reply to author
Forward
0 new messages