GIM_uncert computational time

68 views

Skip to first unread message

Patrick McKenzie

unread,

Sep 3, 2024, 12:32:37 PM9/3/24

to dadi-user

Hi all, thanks to Ryan et al. for such active support here.

I used the Portik et al. 3-population optimization pipeline to find the best fitting model and corresponding parameters for a project I'm working on with ddrad data. I've drawn 1 SNP per RAD locus and am projecting down to maximize segregating sites. As expected with a 3-pop model, optimization took a long time!

Now I'm trying to tackle uncertainty analysis and am finding that it also takes a very long time -- no bootstrap has finished yet (running 48-hour jobs).

My main questions are:

1) do you notice anything potentially incorrect in what I'm doing in the code below? I've written it to just pull one bootstrap but obviously would prefer to pull, e.g., 100.

2) is it possible to parallelize what the GIM_uncert function is doing (e.g. running it on one bootstrap per job and running 100 jobs?)

3) if this is just reflecting the fact that GIM uncertainty calculation for this model+dataset might really just be computationally infeasible, do y'all have any recommendations for how best to interpret results from a Portik-et-al-pipeline-style optimization alone?

I've pasted in my python code below.

Thanks so much!

Patrick

model_name = 'split_symmig_all'

species_ordered = ['a,'b','c']
proj = [32, 26, 100] # three projection values
pts = [140,180,220] # three grid sizes
vcf_path='/path/to/data.vcf'

# Open the VCF file
vcf_reader = vcf.Reader(filename=vcf_path)

# Get the sample names
sample_names = vcf_reader.samples

# read in the population mapping file
ind2pop_path = './ind2pop_dadi.txt'

dd = dadi.Misc.make_data_dict_vcf(vcf_path,
ind2pop_path,)

fs = dadi.Spectrum.from_data_dict(dd, species_ordered, projections = proj, polarized = False)
fs.S()

# 7767.232605334261

## Boostraps

chunk_size = 10000000
chunks = dadi.Misc.fragment_data_dict(dd, chunk_size)

len(chunks)

# 1474

bs = dadi.Misc.bootstraps_from_dd_chunks(chunks,1,species_ordered,proj)

bs[0].S()

# 8006.640899188973

## Define model function ##

###################

def split_symmig_all(params, ns, pts):

"""
Model with split between pop 1 and (2,3), then split between 2 and 3.
Migration is symmetrical between all population pairs (ie 1<->2, 2<->3, and 1<->3).
nu1: Size of population 1 after split.
nuA: Size of population (2,3) after split from 1.
nu2: Size of population 2 after split.
nu3: Size of population 3 after split.
mA: Migration rate between population 1 and population (2,3)
m1: Migration rate between populations 1 and 2 (2*Na*m)
m2: Migration rate between populations 2 and 3
m3: Migration rate between populations 1 and 3
T1: The scaled time between the split of pops 1 vs 2 and 3 (in units of 2*Na generations).
T2: The scaled time between the split of pops 2 and 3 (in units of 2*Na generations).
"""
#10 parameters
nu1, nuA, nu2, nu3, mA, m1, m2, m3, T1, T2 = params

xx = dadi.Numerics.default_grid(pts)
phi = dadi.PhiManip.phi_1D(xx)
phi = dadi.PhiManip.phi_1D_to_2D(xx, phi)

phi = dadi.Integration.two_pops(phi, xx, T1, nu1=nu1, nu2=nuA, m12=mA, m21=mA)

phi = dadi.PhiManip.phi_2D_to_3D_split_2(xx, phi)

phi = dadi.Integration.three_pops(phi, xx, T2, nu1=nu1, nu2=nu2, nu3=nu3, m12=m1, m21=m1, m23=m2, m32=m2, m13=m3, m31=m3)

fs = dadi.Spectrum_mod.Spectrum.from_phi(phi, ns, (xx,xx,xx))
return fs

###################

# Parameters from optimization

p0=[0.5292,2.6191,0.8672,1.455,4.5324,0.116,0.0874,0.0579,0.1295,1.3796]

# Make extrap function

split_symmig_all_ex = dadi.Numerics.make_extrap_func(split_symmig_all)

# This is what doesn't finish.

uncert = dadi.Godambe.GIM_uncert(func_ex=split_symmig_all_ex, grid_pts=pts, all_boot=bs, p0=p0, data=fs)

Ryan Gutenkunst

unread,

Sep 11, 2024, 6:24:48 PM9/11/24

to dadi-user

Hello Patrick,

Sorry for the slow reply! I’m not sure what could be going with your analysis. The GIM calculation cost is not closely related to the number of bootstraps. The expensive part of the calculation is calculating model results for roughly 3 p^2, parameters (where p is the number of parameters).

One way to check this would be to define a simple version of your function that takes in only a single parameter. Something like

def split_symmig_all_ex_simple(params, ns, pts):

full_params = list(params[:2]) + [0.8672,1.455,4.5324,0.116,0.0874,0.0579,0.1295,1.3796]

return split_symmig_all_ex(full_params, ns, pts)

This would be doing uncertainties for the first two parameters, holding the others fixed. It should run much quicker.

We don’t have a built-in method to parallelize the GIM calculations. It’s in principle straightforward, but it hasn’t typically been a bottleneck.

Best,

Ryan

--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/6ac4f026-f874-4096-b5db-762edcde3d9cn%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages