Hi all, thanks to Ryan et al. for such active support here.
I used the Portik et al. 3-population optimization pipeline to find the best fitting model and corresponding parameters for a project I'm working on with ddrad data. I've drawn 1 SNP per RAD locus and am projecting down to maximize segregating sites. As expected with a 3-pop model, optimization took a long time!
Now I'm trying to tackle uncertainty analysis and am finding that it also takes a very long time -- no bootstrap has finished yet (running 48-hour jobs).
My main questions are:
1) do you notice anything potentially incorrect in what I'm doing in the code below? I've written it to just pull one bootstrap but obviously would prefer to pull, e.g., 100.
2) is it possible to parallelize what the GIM_uncert function is doing (e.g. running it on one bootstrap per job and running 100 jobs?)
3) if this is just reflecting the fact that GIM uncertainty calculation for this model+dataset might really just be computationally infeasible, do y'all have any recommendations for how best to interpret results from a Portik-et-al-pipeline-style optimization alone?
I've pasted in my python code below.
Thanks so much!
Patrick
model_name = 'split_symmig_all'
species_ordered = ['a,'b','c']
proj = [32, 26, 100] # three projection values
pts = [140,180,220] # three grid sizes
vcf_path='/path/to/data.vcf'
# Open the VCF file
vcf_reader = vcf.Reader(filename=vcf_path)
# Get the sample names
sample_names = vcf_reader.samples
# read in the population mapping file
ind2pop_path = './ind2pop_dadi.txt'
dd = dadi.Misc.make_data_dict_vcf(vcf_path,
ind2pop_path,)
fs = dadi.Spectrum.from_data_dict(dd, species_ordered, projections = proj, polarized = False)
fs.S()
# 7767.232605334261
## Boostraps
chunk_size = 10000000
chunks = dadi.Misc.fragment_data_dict(dd, chunk_size)
len(chunks)
# 1474
bs = dadi.Misc.bootstraps_from_dd_chunks(chunks,1,species_ordered,proj)
bs[0].S()
# 8006.640899188973
## Define model function ##
###################
def split_symmig_all(params, ns, pts):
"""
Model with split between pop 1 and (2,3), then split between 2 and 3.
Migration is symmetrical between all population pairs (ie 1<->2, 2<->3, and 1<->3).
nu1: Size of population 1 after split.
nuA: Size of population (2,3) after split from 1.
nu2: Size of population 2 after split.
nu3: Size of population 3 after split.
mA: Migration rate between population 1 and population (2,3)
m1: Migration rate between populations 1 and 2 (2*Na*m)
m2: Migration rate between populations 2 and 3
m3: Migration rate between populations 1 and 3
T1: The scaled time between the split of pops 1 vs 2 and 3 (in units of 2*Na generations).
T2: The scaled time between the split of pops 2 and 3 (in units of 2*Na generations).
"""
#10 parameters
nu1, nuA, nu2, nu3, mA, m1, m2, m3, T1, T2 = params
xx = dadi.Numerics.default_grid(pts)
phi = dadi.PhiManip.phi_1D(xx)
phi = dadi.PhiManip.phi_1D_to_2D(xx, phi)
phi = dadi.Integration.two_pops(phi, xx, T1, nu1=nu1, nu2=nuA, m12=mA, m21=mA)
phi = dadi.PhiManip.phi_2D_to_3D_split_2(xx, phi)
phi = dadi.Integration.three_pops(phi, xx, T2, nu1=nu1, nu2=nu2, nu3=nu3, m12=m1, m21=m1, m23=m2, m32=m2, m13=m3, m31=m3)
fs = dadi.Spectrum_mod.Spectrum.from_phi(phi, ns, (xx,xx,xx))
return fs
###################
# Parameters from optimization
p0=[0.5292,2.6191,0.8672,1.455,4.5324,0.116,0.0874,0.0579,0.1295,1.3796]
# Make extrap function
split_symmig_all_ex = dadi.Numerics.make_extrap_func(split_symmig_all)
# This is what doesn't finish.
uncert = dadi.Godambe.GIM_uncert(func_ex=split_symmig_all_ex, grid_pts=pts, all_boot=bs, p0=p0, data=fs)