Nonparametric bootstrapping

Mark Phuong

unread,

Mar 5, 2015, 11:19:51 PM3/5/15

to dadi...@googlegroups.com

Hi Ryan!

Hope you are doing well! I estimated parameters under a three population divergence model and I am trying to estimate parameter uncertainties using a nonparametric bootstrap. After generating ~100 datasets by randomly sampling SNPs (with replacement) from my dataset, I obtained real unit estimates of N, T, and M:

	Tn.s	Tc.s	N_n	N_s	N_c	m12	m13	m23
AVERAGE	206960.2683	104601.5494	64976.18	209939.3	89213.69	398.0323	170.139	9311.397
STDEV	21145.04394	17563.81575	1901.063	9126.291	5645.425	200.0125	88.13612	4734.688
OBSERVED	411967.886	180059.2868	128905	388960.3	177993.4	1114.351	463.508	26982.88

OBSERVED = my observed parameter estimates, AVERAGE = average parameter estimates over the 100 datasets. My question is this: the averages are about half the observed values -- I feel like this is a math error, but I'm not sure why I need to multiply all these values by 2 again, or where my calculations went wrong. Any thoughts?

Here are the equations and parameters I used to get the data:

variants before filtering	32604
total sites	2383683
# SNPs after filtering	5076
effective length (L)	371107.1
mu	2.2E-09
theta	107.412
Nref (theta/(4Lu)	32890.54

Divergence time - 2*tau*Nref

Effective Population size - Nref*N

Effective migrants - m*Nref*2

I've attached a spreadsheet with the calculations, and all the work is under the 'dadi_bootstrap' worksheet.

thanks a million for your advice!

Mark

dadicalculations.xlsx

Gutenkunst, Ryan N - (rgutenk)

unread,

Mar 9, 2015, 2:13:40 PM3/9/15

to dadi...@googlegroups.com

Hello Mark,

Looking at your spreadsheet, I notice that your bootstrap estimates of theta are consistently about a factor of two lower than your full data value. That's likely what's causing the factors of two you're seeing. A quick check is to ensure that all your bootstrap frequency spectra have a similar number of SNPs to your real data. Also, it's helpful to do a dadi.Plotting.plot_2D_comp_Poisson between your real and (a few) bootstrap data sets. You should see that the residuals look like random noise.

Also, note that for a non-parametric bootstrap, you want to be bootstrapping over independent units. So if you're SNPs come from sequencing, they aren't independent from each other due to linkage. What we usually do is divide the genome in large regions (say 1 Mb), then bootstrap over those.

Best,

Ryan

--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To post to this group, send email to dadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/dadi-user.
For more options, visit https://groups.google.com/d/optout.
<dadicalculations.xlsx>

Mark Phuong

unread,

Mar 9, 2015, 8:59:37 PM3/9/15

to dadi...@googlegroups.com

Hi Ryan!

Thanks for your response! My data are from an exon capture approach in a non-model squirrel, so the independent units I am sampling from are the 'genes'. From the set of SNPs I used in my real dataset, I randomly sampled SNPs from those genes to generate a dataset of equal size.

I took a look at the residuals (for three bootstrap sets compared to real data, attached) and I think they look fine (?). What's your opinion on what they look like?

Many thanks Ryan!

Mark

bootstrap1.pdf

bootstrap2.pdf

bootstrap3.pdf

Gutenkunst, Ryan N - (rgutenk)

unread,

Mar 11, 2015, 3:33:16 PM3/11/15

to dadi...@googlegroups.com

Hi Mark,

Assuming those are Poisson and not multinomial residuals, they look great. Exactly like I'd expect.

So I'm not sure where that factor of 2 in theta is coming from. You should be able to run the exact same script for your bootstrap spectra as you did the real data.

Best,

Ryan

<bootstrap1.pdf><bootstrap2.pdf><bootstrap3.pdf>

Mark Phuong

unread,

Mar 11, 2015, 5:03:02 PM3/11/15

to dadi...@googlegroups.com

Hi Ryan!

Thanks for pointing out that clarification -- I used the wrong plotting command. These residuals (attached) look skewed. To fix this problem, I should fix my bootstrapping sampling scheme right?

Many thanks!

Mark

bootstrap3.pdf

bootstrap1.pdf

bootstrap2.pdf

Gutenkunst, Ryan N - (rgutenk)

unread,

Mar 11, 2015, 11:29:16 PM3/11/15

to dadi...@googlegroups.com

Hi Mark,

Yes, it looks like your bootstrapping scheme is generating too few SNPs.

One potential issue you might be encountering. If you're using make_data_dict and from_data_dict functions in dadi, note that they rely on each SNP in the data having a unique ID, defined by the last columns in the data file. So if your bootstrap code is generating such a file, it needs to ensure the last columns or all unique. Otherwise the repeated entries will get dropped.

Best,

Ryan

<bootstrap3.pdf><bootstrap1.pdf><bootstrap2.pdf>

--

Ryan Gutenkunst

Assistant Professor
Molecular and Cellular Biology
University of Arizona
phone: (520) 626-0569
http://gutengroup.mcb.arizona.edu

c.monster

unread,

Apr 22, 2015, 11:49:59 PM4/22/15

to dadi...@googlegroups.com

Hello -

Here is something that might be useful, especially for people studying non-model organisms (especially using RAD):

The first script (vcf2dadi.pl) generates Misc.make_data_dict -readable file from a VCF v.4.1 file, such as emitted by 2bRAD denovo or GATK pipelines (I cannibalized somebody else's script but sadly cannot find the original anymore, so I apologize for not listing the original author)

The second script (dadiBoot.pl) uses the output of vcf2dadi.pl to bootstrap over entries listed in the "Gene" column. Those would be genomic contigs or scaffolds. The script makes sure that the same locus and coordinate does not repeat, so it seems to work just fine in terms of theta.

I am personally very happy with the nonparametric bootstrap results for my own 2bRAD data. Since my dadi models started accounting for a possibility of ancestral state misidentification (as per Ryan's suggestion in another thread "Excess of low-frequency variants") they fit really well and the results are quite consistent across bootstrap replicates (including theta).

cheers

Mikhail

vcf2dadi.pl

dadiBoot.pl

Gutenkunst, Ryan N - (rgutenk)

unread,

Apr 23, 2015, 12:14:50 PM4/23/15

to dadi...@googlegroups.com

Hi Mikhail,

Thanks for posting these! I think others will find them very useful.

Best,

Ryan

--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To post to this group, send email to dadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/dadi-user.
For more options, visit https://groups.google.com/d/optout.

<vcf2dadi.pl><dadiBoot.pl>

--

Ryan Gutenkunst

Assistant Professor
Molecular and Cellular Biology
University of Arizona

phone: (520) 626-0569, office LSS 325

http://gutengroup.mcb.arizona.edu

Mark Phuong

unread,

Jun 3, 2015, 3:43:40 AM6/3/15

to dadi...@googlegroups.com

Hi Ryan!

Thanks for the tip. That was definitely what was causing my initial bootstrapping problems.

I have a follow up question about interpretation. I'm primarily interested in migration rate estimates from a 3 pop divergence model, and it seems that from the observed paramter estimates, M23 is about 10X higher than M13 or M12. I did the bootstrapping, and obtained the following confidence intervals:

M12	360 [0 - 588]
M13	367 [0 - 595]
M23	7561 [0 - 10859]

following the calculation of the MEAN +- 1.96*STDEV. I am told this means that there is a signal that M23 is higher, but the wide confidence intervals suggest that I can't say that with certainty. Is that all I can say about these values? For every bootstrap set, the migration rate values seem to change together, in a sense that the ratio between the migration rates always show M23 being 10-20X higher than the other two rates. Does that mean anything? I've attached the parameter estimates for the bootstrap datasets. All my other parameters seem to follow a normal distribution and have pretty tight confidence intervals (N3 pictured as an example), whereas my migration rate estimates seem to jump all over the place and are somewhat left-centered.

Many thanks for your advice!

Mark

dadi_bootstrap.pdf

Gutenkunst, Ryan N - (rgutenk)

unread,

Jun 3, 2015, 3:54:43 PM6/3/15

to dadi...@googlegroups.com

Hi Mark,

I often find it helpful to plot the joint distributions of the bootstrap parameters. For example, a scatter plot of M23 vs M12, where each point is a bootstrap result. That will reveal correlations between your inferred parameters. You can make a statement like "In 99/100 bootstraps M23 > M12, indicating that migration between populations 2 and 3 is higher than that between 1 and 2"

Best,

Ryan

<dadi_bootstrap.pdf>

--

Ryan Gutenkunst

Assistant Professor
Molecular and Cellular Biology
University of Arizona

phone: (520) 626-0569, office LSS 325

http://gutengroup.mcb.arizona.edu

Mark Phuong

unread,

Jun 3, 2015, 7:22:04 PM6/3/15

to dadi...@googlegroups.com

Sweet. Thanks for the tip Ryan!

Reply all

Reply to author

Forward