Nonparametric bootstrapping

685 views
Skip to first unread message

Mark Phuong

unread,
Mar 5, 2015, 11:19:51 PM3/5/15
to dadi...@googlegroups.com
Hi Ryan!

Hope you are doing well! I estimated parameters under a three population divergence model and I am trying to estimate parameter uncertainties using a nonparametric bootstrap. After generating ~100 datasets by randomly sampling SNPs (with replacement) from my dataset, I obtained real unit estimates of N, T, and M: 

Tn.s Tc.s N_n N_s N_c m12 m13 m23
AVERAGE  206960.2683 104601.5494 64976.18 209939.3 89213.69 398.0323 170.139 9311.397
STDEV 21145.04394 17563.81575 1901.063 9126.291 5645.425 200.0125 88.13612 4734.688
OBSERVED  411967.886 180059.2868 128905 388960.3 177993.4 1114.351 463.508 26982.88

OBSERVED = my observed parameter estimates, AVERAGE = average parameter estimates over the 100 datasets. My question is this: the averages are about half the observed values -- I feel like this is a math error, but I'm not sure why I need to multiply all these values by 2 again, or where my calculations went wrong. Any thoughts?

Here are the equations and parameters I used to get the data:

variants before filtering 32604
total sites 2383683
# SNPs after filtering 5076
effective length (L) 371107.1
mu 2.2E-09
theta 107.412
Nref (theta/(4*L*u) 32890.54
Divergence time - 2*tau*Nref
Effective Population size - Nref*N
Effective migrants - m*Nref*2

I've attached a spreadsheet with the calculations, and all the work is under the 'dadi_bootstrap' worksheet.

thanks a million for your advice!

Mark


dadicalculations.xlsx

Gutenkunst, Ryan N - (rgutenk)

unread,
Mar 9, 2015, 2:13:40 PM3/9/15
to dadi...@googlegroups.com
Hello Mark,

Looking at your spreadsheet, I notice that your bootstrap estimates of theta are consistently about a factor of two lower than your full data value. That's likely what's causing the factors of two you're seeing. A quick check is to ensure that all your bootstrap frequency spectra have a similar number of SNPs to your real data. Also, it's helpful to do a dadi.Plotting.plot_2D_comp_Poisson between your real and (a few) bootstrap data sets. You should see that the residuals look like random noise.

Also, note that for a non-parametric bootstrap, you want to be bootstrapping over independent units. So if you're SNPs come from sequencing, they aren't independent from each other due to linkage. What we usually do is divide the genome in large regions (say 1 Mb), then bootstrap over those.

Best,
Ryan

--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To post to this group, send email to dadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/dadi-user.
For more options, visit https://groups.google.com/d/optout.
<dadicalculations.xlsx>

Mark Phuong

unread,
Mar 9, 2015, 8:59:37 PM3/9/15
to dadi...@googlegroups.com
Hi Ryan!

Thanks for your response! My data are from an exon capture approach in a non-model squirrel, so the independent units I am sampling from are the 'genes'. From the set of SNPs I used in my real dataset, I randomly sampled SNPs from those genes to generate a dataset of equal size. 

I took a look at the residuals (for three bootstrap sets compared to real data, attached) and I think they look fine (?). What's your opinion on what they look like?

Many thanks Ryan!

Mark
bootstrap1.pdf
bootstrap2.pdf
bootstrap3.pdf

Gutenkunst, Ryan N - (rgutenk)

unread,
Mar 11, 2015, 3:33:16 PM3/11/15
to dadi...@googlegroups.com
Hi Mark,

Assuming those are Poisson and not multinomial residuals, they look great. Exactly like I'd expect.

So I'm not sure where that factor of 2 in theta is coming from. You should be able to run the exact same script for your bootstrap spectra as you did the real data.

Best,
Ryan

<bootstrap1.pdf><bootstrap2.pdf><bootstrap3.pdf>

Mark Phuong

unread,
Mar 11, 2015, 5:03:02 PM3/11/15
to dadi...@googlegroups.com
Hi Ryan!

Thanks for pointing out that clarification -- I used the wrong plotting command. These residuals (attached) look skewed. To fix this problem, I should fix my bootstrapping sampling scheme right?

Many thanks!

Mark
bootstrap3.pdf
bootstrap1.pdf
bootstrap2.pdf

Gutenkunst, Ryan N - (rgutenk)

unread,
Mar 11, 2015, 11:29:16 PM3/11/15
to dadi...@googlegroups.com
Hi Mark,

Yes, it looks like your bootstrapping scheme is generating too few SNPs.

One potential issue you might be encountering. If you're using make_data_dict and from_data_dict functions in dadi, note that they rely on each SNP in the data having a unique ID, defined by the last columns in the data file. So if your bootstrap code is generating such a file, it needs to ensure the last columns or all unique. Otherwise the repeated entries will get dropped.

Best,
Ryan

<bootstrap3.pdf><bootstrap1.pdf><bootstrap2.pdf>

--
Ryan Gutenkunst
Assistant Professor
Molecular and Cellular Biology
University of Arizona
phone: (520) 626-0569
http://gutengroup.mcb.arizona.edu

c.monster

unread,
Apr 22, 2015, 11:49:59 PM4/22/15
to dadi...@googlegroups.com
Hello - 

Here is something that might be useful, especially for people studying non-model organisms (especially using RAD): 

The first script (vcf2dadi.pl) generates Misc.make_data_dict -readable file from a VCF v.4.1 file, such as emitted by 2bRAD denovo or GATK pipelines (I cannibalized somebody else's script but sadly cannot find the original anymore, so I apologize for not listing the original author)

The second script (dadiBoot.pl) uses the output of vcf2dadi.pl to bootstrap over entries listed in the "Gene" column. Those would be genomic contigs or scaffolds. The script makes sure that the same locus and coordinate does not repeat, so it seems to work just fine in terms of theta.

I am personally very happy with the nonparametric bootstrap results for my own 2bRAD data. Since my dadi models started accounting for a possibility of ancestral state misidentification (as per Ryan's suggestion in another thread  "Excess of low-frequency variants") they fit really well and the results are quite consistent across bootstrap replicates (including theta).

cheers

Mikhail
vcf2dadi.pl
dadiBoot.pl

Gutenkunst, Ryan N - (rgutenk)

unread,
Apr 23, 2015, 12:14:50 PM4/23/15
to dadi...@googlegroups.com
Hi Mikhail,

Thanks for posting these! I think others will find them very useful.

Best,
Ryan

--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To post to this group, send email to dadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/dadi-user.
For more options, visit https://groups.google.com/d/optout.
<vcf2dadi.pl><dadiBoot.pl>

--
Ryan Gutenkunst
Assistant Professor
Molecular and Cellular Biology
University of Arizona

Mark Phuong

unread,
Jun 3, 2015, 3:43:40 AM6/3/15
to dadi...@googlegroups.com
Hi Ryan!

Thanks for the tip. That was definitely what was causing my initial bootstrapping problems. 

I have a follow up question about interpretation. I'm primarily interested in migration rate estimates from a 3 pop divergence model, and it seems that from the observed paramter estimates, M23 is about 10X higher than M13 or M12. I did the bootstrapping, and obtained the following confidence intervals:

M12 360 [0 - 588]
M13 367 [0 - 595]
M23 7561 [0 - 10859]

following the calculation of the MEAN +- 1.96*STDEV. I am told this means that there is a signal that M23 is higher, but the wide confidence intervals suggest that I can't say that with certainty. Is that all I can say about these values? For every bootstrap set, the migration rate values seem to change together, in a sense that the ratio between the migration rates always show M23 being 10-20X higher than the other two rates. Does that mean anything? I've attached the parameter estimates for the bootstrap datasets. All my other parameters seem to follow a normal distribution and have pretty tight confidence intervals (N3 pictured as an example), whereas my migration rate estimates seem to jump all over the place and are somewhat left-centered.

Many thanks for your advice!

Mark
dadi_bootstrap.pdf

Gutenkunst, Ryan N - (rgutenk)

unread,
Jun 3, 2015, 3:54:43 PM6/3/15
to dadi...@googlegroups.com
Hi Mark,

I often find it helpful to plot the joint distributions of the bootstrap parameters. For example, a scatter plot of M23 vs M12, where each point is a bootstrap result. That will reveal correlations between your inferred parameters. You can make a statement like "In 99/100 bootstraps M23 > M12, indicating that migration between populations 2 and 3 is higher than that between 1 and 2"

Best,
Ryan

<dadi_bootstrap.pdf>

--
Ryan Gutenkunst
Assistant Professor
Molecular and Cellular Biology
University of Arizona
phone: (520) 626-0569, office LSS 325

Mark Phuong

unread,
Jun 3, 2015, 7:22:04 PM6/3/15
to dadi...@googlegroups.com
Sweet. Thanks for the tip Ryan!
Reply all
Reply to author
Forward
0 new messages