Extracting ancestral tracts from admixed population in two population admixture model by slim and tskit

37 views
Skip to first unread message

yang lu

unread,
May 11, 2025, 5:13:30 AMMay 11
to slim-discuss
Extracting ancestral tracts from admixed population in two population admixture model by slim and tskit
 
Hello everyone! I'm encountering an issue with ancestral tract length analysis from tree sequences simulated using SLiM. While both my SLiM model and Python analysis pipeline run without errors, the extracted average ancestral tract lengths for both populations show significant discrepancies from theoretical expectations. For instance, under an admixture event occurring G = 200 generations ago with proportions m₁ = 0.4 and m₂ = 0.6, the expected average ancestral tract lengths should follow L = 1 / [G × (1 − mᵢ)]. However, the values simulated from my Python pipeline deviate significantly from this theoretical expectation.

And my code is written below:

SLiM:

initialize() {
    setSeed(SEED);
    initializeTreeSeq();
    initializeMutationRate(MU);
    initializeMutationType("m1", 0.5, "f", 0.0);
    initializeMutationType("m2", 0.5, "f", 2*SEL1);
    initializeGenomicElementType("g1", m1, 1.0);
    initializeGenomicElement(g1, 0, asInteger(L));
    initializeRecombinationRate(R);
    //defineConstant("G",GEN);
    defineConstant("tree_seq_path",TREE_TMP_PATH+"tmp_tree_"+SEED+".trees");
    defineConstant("vcf_path_ref",VCF_PATH+"tmp_vcf_ref_"+SEED+".vcf");
    defineConstant("vcf_path_adm",VCF_PATH+"tmp_vcf_adm_"+SEED+".vcf");
    //defineConstant("Burn-start",10*Ne+1)
    }
1 early() {
    cat("start simulation\n");
    sim.addSubpop("p1", Ne);
    sim.addSubpop("p2", Ne);
    }
1: late(){
catn(sim.cycle);
}
10*Ne late(){
    sim.treeSeqOutput(tree_seq_path);
    target=sample(p1.genomes, 1);
    target.addNewDrawnMutation(m2, asInteger(L/2));
}
10*Ne: late(){
    muts = sim.mutationsOfType(m2);
    counts = p1.genomes.countOfMutationsOfType(m2);
    freq = mean(counts > 0);
    if (size(muts) !=1)
    {
        cat("RESTART    "+sim.cycle+"\n");
        catn("-------------");
        sim.readFromPopulationFile(tree_seq_path);
        setSeed(getSeed() + 1);
        target = sample(p1.genomes, 1);
        target.addNewDrawnMutation(m2, asInteger(L/2));}
    else{
        if (freq>FREQ_LB)
        {   times=sim.cycle+1;
            community.rescheduleScriptBlock(s1, start=times, end=times);
        community.rescheduleScriptBlock(s2, start=times, end=times);
            community.rescheduleScriptBlock(s3, start=times+GEN, end=times+GEN);
            //community.rescheduleScriptBlock(s4, start=times+GEN, end=times+GEN);
            community.deregisterScriptBlock(self);
        }
    }
}

s1 10000000 early(){
    sim.addSubpop("p3", Ne);
    catn("Start admixture at "+sim.cycle);
    p3.setMigrationRates(c(p1, p2),c(MIG_PROP, 1-MIG_PROP));

}
s2 100000000 late(){
    m2muts = sim.mutationsOfType(m2);
    m2muts.setSelectionCoeff(2*SEL2);
    p3.setMigrationRates(c(p1, p2),c(0,0));
}
s3 1000000000 late(){
    catn(tree_seq_path);
    c(p1.genomes[0:(NUMSIMU-1)],p2.genomes[0:(NUMSIMU-1)]).outputVCF(vcf_path_ref,outputMultiallelics=F);
    //c(p3.genomes[0:NUMSIMU]).outputVCF(vcf_path_adm,outputMultiallelics=F);
    p1.setSubpopulationSize(0);
    p2.setSubpopulationSize(0);
    //p3.setSubpopulationSize(NUMSIMU);
    sim.treeSeqOutput(tree_seq_path);
    //sim.simulationFinished();
cat("end simulation\n");
}

I set a very weak selection to approximate the neutral condition
 
Python:

def find_anc_node(tree,node_id):
## find the "root" node (may not fully coelescent when the sample are from different ancestry)
    current_node = node_id
    while current_node != tskit.NULL:
        parent = tree.parent(current_node)
        if parent == -1:  
            break
        current_node = parent
    return current_node

seeder=123
rec_rate=1e-8
Ne=500
num_simu=100    
num_simu_ref=int(2*num_simu)             #the sample number of reference hap
num_simu_adm=num_simu                  #the sample number of reference diploids
freq_lb=0.995
mig_prop=0.4
mu=1e-8
sel1=0.01
sel2=0.0000000001   # very weak selection nearly neutral
L=1e8
Gen=200
vcf_save_path_prefix="/data/yanglu/0503compare_methods/vcfpath/"
tree_tmp_path_prefix="/data/yanglu/0503compare_methods/treetmp/"
slim_path="/home/luyang/0503compare_methods/slim_2pop.slim"

##SLiM simulation
slim_commands=f"slim -d L={int(L)} -d SEED={seeder} -d NUMSIMU={num_simu_ref} -d FREQ_LB={freq_lb} -d MIG_PROP={mig_prop} -d MU={mu} -d SEL1={sel1} -d SEL2={sel2} -d Ne={Ne} -d R={rec_rate} -d GEN={Gen} -d \"VCF_PATH='{vcf_save_path_prefix}'\" -d \"TREE_TMP_PATH='{tree_tmp_path_prefix}'\" {slim_path}"
# subprocess.run(slim_commands, shell=True, capture_output=True,text=True)
subprocess.run(slim_commands, shell=True)

## Extract the subtree of admixed population of sample
tree_total_path=f"/data/yanglu/0503compare_methods/treetmp/tmp_tree_{seeder}.trees"
ts = tskit.load(tree_total_path)
adm_vcf_path=f"/data/yanglu/0503compare_methods/vcfpath/tmp_vcf_adm_{seeder}.vcf"
sampled_indivs = random.sample(range(ts.num_individuals), num_simu_adm)
sampled_nodes = []
for ind in sampled_indivs:
    sampled_nodes.extend(ts.individual(ind).nodes)
# print(sampled_indivs) #[5, 16, 12, 43]
# print(sampled_nodes)  #[10, 11, 32, 33, 24, 25, 86, 87]
ancts=ts.simplify(sampled_nodes,filter_populations=False,keep_input_roots=True)

## The treeinfo table records the ancestral origins of different genomic tracts along admixed population chromosomes."
tree_res=[]
for node in ancts.samples():
    for tree in ancts.trees():
        interval=[tree.interval.left,tree.interval.right]
        anc_node=find_anc_node(tree,node)
        anc=tree.population(anc_node)
        # print(tree.index,node,anc,tree.interval.left,tree.interval.right)
        tree_res.append([
            tree.index,      
            node,            
            anc,              
            tree.interval.left,
            tree.interval.right
        ])
treeinfo = pd.DataFrame(tree_res, columns=["tree_id", "sam_node", "anc", "inter_left", "inter_right"])
treeinfo["inter_left"]=treeinfo["inter_left"]*rec_rate
treeinfo["inter_right"]=treeinfo["inter_right"]*rec_rate
treeinfo["length"]=treeinfo["inter_right"]-treeinfo["inter_left"]
treeinfo.to_csv("/home/luyang/0503compare_methods/treeinfo.csv",index=False,sep="\t")
treeinfo.drop(["length"], axis=1, inplace=True)

## Merge the adajacent interval with same ancestry
merge_anc=pd.DataFrame([])
for hap in treeinfo["sam_node"].unique():
    df=treeinfo.loc[treeinfo["sam_node"]==hap,].reset_index(drop=True)
    df['group'] = (df['anc'] != df['anc'].shift()).cumsum()
    result = df.groupby(['anc', 'group']).agg({'sam_node': 'first','inter_left': 'first','inter_right': 'last',"anc":"first"}).reset_index(drop=True)
    merge_anc=pd.concat([merge_anc,result],axis=0)
merge_anc=merge_anc.reset_index(drop=True)
# merge_anc = pd.DataFrame(merge_anc, columns=["hap", "inter_left", "inter_right", "target_anc"])
merge_anc["length"]=merge_anc["inter_right"]-merge_anc["inter_left"]

##calculate the average length ancestral tracts
print(merge_anc)
print(merge_anc.groupby(["anc"]).mean())## quite different from the theoratical results



Could there be any issues with my approach to extracting ancestral tracts? Thanks in advance

Peter Ralph

unread,
May 12, 2025, 9:26:35 AMMay 12
to yang lu, slim-discuss
Hi, Yang - let's see.

First, some general comments. It'd be nice to have a FAQ or something  for this sort of "simulation doesn't match my expectations" question, similar to Ben's "Ten simple rules for reporting a bug" (https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010540), because they are very tricky to answer. The reason they are tricky is because "theory" can mean a lot of different things, and for that matter, so can "match". So, this sort of question is much more likely to get a useful and interesting answer if it includes:
  1. explicit description of what the theoretical result is (you've got this; "average ancestral tract lengths should follow L = 1 / [G × (1 − mᵢ)]")
  2. a plot of the observed and expected values as they change with some parameter
  3. citation of somewhere that explains/derives the result
  4. ideally, an overview of what assumptions and approximations underly the result (e.g., "under the WF model for large N with infinite sites mutations")
  5. a high-level description of the SLiM model and how the relevant statistics are calculated
  6. the SLiM recipe and other code (you've got this), that just runs one someone else's computer (yours doesn't), made as simple as possible
This is not directed specifically at you - these sorts of questions are pretty common, and natural, and it's not obvious to many people how a question like this can be hard without additional information.

That said, here's my guess for what the 1/[G (1-m_i)] result is from: blocks inherited from G generations are on average 1/G long; and the number of adjacent blocks from source i is Geometric with mean (1/m_i).  This makes very few assumptions, but does assume an infinite chromosome; I don't know if that is likely to be relevant at G=200. To diagnose this it'd be helpful to see how things compare at different values of G or L.

It's hard for me to tell exactly what's going on in the code (mostly because of the stuff about selection; you should remove that and test), but it looks to me like there's two possible issues:
  1. you're assigning 'population origin' using the root instead of the ancestor at time G; this looks correct because there's no migration between p1 and p2 before G, I think, but might be worth verifying
  2. wait actually the admixed population is created de novo at time G, isn't it? (It's hard to tell because of the reschedule script blocks; an example script posted here would be much more readable if rewritten without that.) So, actually the ancestry of genomes in p3 would come from p1, p2 *and* p3, not just p1 and p2, and the model I'm using above doesn't hold at all?

So - looks to me like you're not simulating an admixture event, but rather recent migration between three populations? I could be wrong; you need to spend time simplifying the script to get input on this.

-peter


From: slim-d...@googlegroups.com <slim-d...@googlegroups.com> on behalf of yang lu <yanglu12...@126.com>
Sent: Sunday, May 11, 2025 2:13 AM
To: slim-discuss <slim-d...@googlegroups.com>
Subject: Extracting ancestral tracts from admixed population in two population admixture model by slim and tskit
 
--
SLiM forward genetic simulation: http://messerlab.org/slim/
---
You received this message because you are subscribed to the Google Groups "slim-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slim-discuss...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/slim-discuss/c1de075b-db40-4b81-91d6-216b52f38770n%40googlegroups.com.

yang lu

unread,
May 13, 2025, 8:51:12 AMMay 13
to slim-discuss

Dear Peter,

I sincerely apologize for any confusion caused by my initial question. As this is my first time posting on the forum, my explanation may have been somewhat disorganized, and the simplified code I provided likely lacked necessary descriptions. I truly appreciate you sharing "Ten Simple Rules for Reporting a Bug" - I will carefully study this resource (thank you very much for the recommendation).

I will revisit this issue with my colleagues to further clarify the problem, and then simplify both the explanation and the code accordingly.I will further organize and refine my thoughts on this matter, and we can continue the discussion afterward.

Once again, I'm very sorry for the inconvenience, and I truly appreciate you taking the time to understand my question and provide such valuable guidance.

Best regards,

Peter Ralph

unread,
May 13, 2025, 11:42:02 AMMay 13
to yang lu, slim-discuss
No worries, Yang! Best of luck.

Reply all
Reply to author
Forward
0 new messages