Connecting Individuals in pyslim recapitation to SLiM metadata

8 views
Skip to first unread message

Ellie Weise

unread,
Mar 12, 2026, 3:52:39 PM (5 days ago) Mar 12
to slim-discuss
Hi everyone,

I have what I think is a housekeeping issue connecting my individuals generated in SLiM to (eventual) genotypes generated using recapitation and mutation functions in pyslim. I will say right off the bat that my python skills are pretty poor, but I'm wondering if I missing something in the initializeTree or OutputTreeSeq calls. I haven't changed much from the defaults in the recipes for recapitation at this point since it's pretty new to me :)

So I have a SLiM script that creates a large (K = 10,000 fish), single population of long-lived "Halibut" for 200 years, with population growth between year 120 and 200. I have sampling from years 195 and 200, and they are tracked in an output file which looks like this: 

ID    syears_pre  pop_pre   lenquan     length      SLiMID

M191_0_21634                  194   0                 0.554196    43.8083     21633

F190_0_21101                  194   0                 0.321133    54.3978     21100

M190_0_21234                  194   0                 0.740409    65.4842     21233

M190_0_20931                  194   0                 0.308491    49.7787     20930

F190_0_21110                  194   0                 0.792312    72.163      21109

F190_0_21211                  194   0                 0.0735019   40.7073     21210

M190_0_21397                  194   0                 0.852234    70.9921     21396


I figured that the tree sequence would retain those individual IDs from the simulation so I could connect the metadata generated with the genotypes from the recapitation script, but the output of the .vcf seems to just have individual numbers ind1-ind[population size of year 200].  I am wondering if there's a way to just get genotypes for my sampled individuals from their respective years, and how to retain their individual numbers?

I have pasted what I think are the relevant python code section below (it is basically the tutorial at this point though), let me know if you need any other information or code sections and I will add those as well :) thank you so much in advance for your help!

All the best,
Ellie
------
slim_ts = tskit.load("D:/SLiMOutputs/decap.trees/chromosome_1.trees")

#NTS# recapitation takes several hours/chromosome for this pop size
recap_ts = pyslim.recapitate(slim_ts, recombination_rate=1e-5, ancestral_Ne=1000)

next_id = pyslim.next_slim_mutation_id(recap_ts)
ts = msprime.sim_mutations(
           recap_ts,
           rate=1e-8,
           model=msprime.SLiMMutationModel(type=0, next_id=next_id),
           keep=True,
)

nts = pyslim.generate_nucleotides(ts)
nts = pyslim.convert_alleles(nts)
with open("D:/SLiMOutputs/decap.trees/RecapTest_Chr1_python.vcf", "w") as vcffile:
    nts.write_vcf(vcffile)






Ben Haller

unread,
Mar 12, 2026, 6:11:41 PM (4 days ago) Mar 12
to slim-d...@googlegroups.com
Hi Ellie!

I'm a little confused because you refer to tree-sequence recording but also VCF.  Are you using both?  It would be helpful to get complete (but minimal for showing the question) SLiM and python scripts, if this turns out not to be an easy answer.


You write:

> I figured that the tree sequence would retain those individual IDs from the simulation so I could connect the metadata generated with the genotypes from the recapitation script

Yes, for individuals that are retained in the tree sequence or are present in the individual table for other reasons, the SLiM id of the individuals is present in the individuals metadata.  If there are individuals you need to be present in the tree sequence that aren't, you can call treeSeqRememberIndividuals() to ensure that those individuals are kept.  See the pyslim doc for lots of guidance on accessing metadata for things in the tree sequence, I think.


But then you also write:

> but the output of the .vcf seems to just have individual numbers ind1-ind[population size of year 200]

Yes, SLiM IDs don't get written into VCF output by SLiM.  In some cases they could be, potentially, but in other cases they couldn't really be, since the "individuals" in the SLiM VCF output don't necessarily correspond to actual individuals (since you can just write a vector of haplosomes to VCF, too).  So if you are, in fact, writing a vector of individuals to VCF, you can do a parallel write of all their individual SLiM IDs to a separate file, and then match up IDs to VCF individuals, one to one, using that file.

I hope that helps; happy modeling!

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University
--
SLiM forward genetic simulation: http://messerlab.org/slim/
---
You received this message because you are subscribed to the Google Groups "slim-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slim-discuss...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/slim-discuss/d990cd41-65b4-42e6-bd95-1e54e8297cd2n%40googlegroups.com.

Ellie Weise

unread,
Mar 13, 2026, 9:37:08 AM (4 days ago) Mar 13
to slim-discuss
Hi Ben,

Thanks! I think that is absolutely getting at my question but I do have some follow ups :) I am not writing a VCF in SLiM, just the tree sequence as an output, and then the VCF comes from the recapitation script in pyslim. For the SLiM side, I am wondering where to put the RememberTreeSeq line (in green in the chunk below)? I don't need all of the individuals at the end, but at least the sample would be helpful. Here is the sampling section of the code where we do the sampling, would I be able to add the RememberTreeSeq there to keep track of those individuals? My goal is to keep the individual names consistent, or at least have the ability to match later, between the SLiM metadata that we generate and then store in the sampling and pedigree outputs, and the pyslim VCF output that we generate later.

// sample 3% of possible sample population during years 195-200, inclusive

195:200 early() {

// get the total number of individuals of age 2 to 19

num_2_to_19 = sum(tabulate(p1.individuals.age, maxbin = 50)[2:19]);

nsamp = rbinom(1,num_2_to_19,0.03);

YearSamp = NULL;

for(a in 1:17) {

TmpProp = SampProp[a];

// get number for the given age

ns = rbinom(1, nsamp, TmpProp);

if(ns > 0) {

// then sample them:

AgeSamp = p1.sampleIndividuals(ns, minAge = a+1, maxAge = a+1);

YearSamp = c(YearSamp,AgeSamp);

}

}

//YearSamp.setValue("extant_tags",0.0);

YearSamp.fitnessScaling = 0.0;

YearSamp.RememberTreeSeq();

// then write these into the output

for(s in YearSamp) {

s_name = paste0(s.sex, sim_year(sim.cycle) - s.age, "_0_", s.tag);

line = paste(s_name, "", "", sim_year(sim.cycle), "0", "", "",s.x,s.y, s.pedigreeID,sep = "\t");

writeFile("spip_samples.tsv", line, append = T);

}

}


The other place where I was thinking there might be issues is in the initialization, I attached that code below. I made the treeseq related lines green because there's several constants defined for reproduction/growth that I don't think are relevant except to give a sense of what we're tracking for the individuals? 


initialize() {

initializeSLiMModelType("nonWF");

initializeSLiMOptions(keepPedigrees=T);

initializeTreeSeq(simplificationRatio=INF,timeUnit="generations");

setwd("C:/Users/danielruzzante/Documents/SLiMOutputs/K10000_80000");


//define carrying capacities for original (K1) and growth period (K)

defineConstant("K", 80000);

defineConstant("K1", 10000);


// "species 1" mortality rates for ages 0 to 50. (NOTE: All 50-year-olds die)

defineConstant("L", c(0.0, 0.13, 0.13, 0.13, 0.13, 0.15, 0.18, 0.21, 0.21, 0.22, 0.22, 0.21, 0.21, 0.21, 0.20, 0.20, 0.20, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 1.00));

// male fecundity rates for ages 0 to 50.

defineConstant("FecM", c(0.000, 0.000, 0.002, 0.012, 0.044, 0.107, 0.189, 0.275, 0.355, 0.428, 0.492, 0.548, 0.595, 0.637, 0.674, 0.710, 0.745, 0.782, 0.823, 0.868, 0.918, 0.970, 1.024, 1.079, 1.134, 1.190, 1.247, 1.305, 1.364, 1.424, 1.486, 1.550, 1.616, 1.684, 1.754, 1.826, 1.900, 1.976, 2.054, 2.135, 2.218, 2.303, 2.391, 2.481, 2.573, 2.668, 2.765, 2.865, 2.968, 3.073));

// female fecundity rates for ages 0 to 50 - divided by 4 to decrease total offspring number

defineConstant("FecF", c(0.00000, 0.00000, 0.00000, 0.00025, 0.00100, 0.00425, 0.01575, 0.04525, 0.09825, 0.16225, 0.22275, 0.27600, 0.32450, 0.37075, 0.41675, 0.46350, 0.51125, 0.55950, 0.60800, 0.65650, 0.70400, 0.75000, 0.79375, 0.83525, 0.87450, 0.91225, 0.94950, 0.98650, 1.02375, 1.06125, 1.09925, 1.13850, 1.17825, 1.21925, 1.26150, 1.30450, 1.34850, 1.39375, 1.44000, 1.48750, 1.53600, 1.58550, 1.63625, 1.68825, 1.74150, 1.79575, 1.85125, 1.90800, 1.96600, 2.02550));

//The female growth length means for 0 to 50

defineConstant("lengthM",c(1, 8.25, 25.84, 42.31, 56.64, 68.99, 79.55, 88.47, 95.93, 102.08, 107.11, 111.17, 114.44, 117.09, 119.27, 121.16, 122.89, 124.55, 126.22, 128, 129.92, 131.92, 133.96, 135.97, 137.94, 139.85, 141.72, 143.56, 145.36, 147.15, 148.92, 150.68, 152.44, 154.2, 155.96, 157.72, 159.48, 161.24, 163, 164.76, 166.52, 168.28, 170.04, 171.8, 173.56, 175.32, 177.08, 178.84, 180.6, 182.36, 184.12));

//The female growth length sds

defineConstant("lengthSDM",c(0.31, 2.48, 7.23, 10.99, 13.72, 15.68, 17.05, 18, 18.65, 19.08, 19.37, 19.55, 19.67, 19.75, 19.8, 19.83, 19.86, 19.87, 19.88, 19.89, 19.88, 19.87, 19.84, 19.8, 19.76, 19.71, 19.65, 19.58, 19.51, 19.43, 19.34, 19.25, 19.15, 19.04, 18.92, 18.79, 18.66, 18.52, 18.38, 18.22, 18.06, 17.89, 17.71, 17.53, 17.33, 17.13, 16.93, 16.71, 16.49, 16.26, 16.02));

//The male growth length means for 0 to 50

defineConstant("lengthF",c(1, 6.31, 25.84, 44.33, 60.85, 75.54, 88.55, 100.04, 110.17, 119.08, 126.93, 133.87, 140.05, 145.64, 150.77, 155.6, 160.2, 164.58, 168.72, 172.63, 176.31, 179.74, 182.91, 185.8, 188.42, 190.83, 193.08, 195.21, 197.28, 199.31, 201.3, 203.27, 205.23, 207.2, 209.16, 211.13, 213.09, 215.06, 217.02, 218.99, 220.95, 222.92, 224.88, 226.85, 228.81, 230.78, 232.74, 234.71, 236.67, 238.64, 240.6));

//The male growth length sds

defineConstant("lengthSDF",c(0.29, 1.8, 6.84, 10.88, 13.89, 16.09, 17.67, 18.76, 19.5, 19.98, 20.26, 20.4, 20.44, 20.41, 20.32, 20.19, 20.02, 19.82, 19.59, 19.35, 19.08, 18.81, 18.54, 18.28, 18.02, 17.77, 17.53, 17.29, 17.05, 16.8, 16.56, 16.3, 16.04, 15.77, 15.49, 15.21, 14.91, 14.61, 14.3, 13.98, 13.66, 13.32, 12.98, 12.63, 12.27, 11.9, 11.53, 11.15, 10.76, 10.36, 9.95));


// constant for sampling ages 2-19 - proportion

defineConstant("SampProp", c(0.003, 0.003, 0.008, 0.027, 0.041, 0.099, 0.109, 0.134, 0.152, 0.145, 0.125, 0.076, 0.034, 0.029, 0.008, 0.002, 0.002, 0.002));


//initialize genome section

initializeSex();

initializeMutationType("m1", 0.5, "f", 0.0);

initializeGenomicElementType("g1", m1, 1.0);


// make halibut-type genome assembly

ids = 1:24;

lengths = c(30597443, 11297635, 32675587, 28512877, 30753830,

28531750, 28364880, 23524515, 29170869, 27686247,

22214583, 21256114, 29083852, 24953334, 25621292,

27148517, 24735652, 16627714, 20912647, 20800843,

25055940, 26312018, 18431247, 21435488);

for (id in ids, length in lengths)

{

initializeChromosome(id, length, "A");

initializeMutationRate(0);

initializeRecombinationRate(1e-5);

initializeGenomicElement(g1);

}

// delete any files that will be appended to and write the headers on them that CKMRpop expects

writeFile("spip_pedigree.tsv", "year\tpop\tkid\tpa\tma\tlenquan\tlength\tSLiMID");

writeFile("spip_prekill_census.tsv", "year\tpop\tage\tmale\tfemale");

writeFile("spip_postkill_census.tsv", "year\tpop\tage\tmale\tfemale");

writeFile("spip_samples.tsv", "ID\tsyears_pre\tpop_pre\tsyears_post\tpop_post\tsyears_dur\tpop_dur\tlenquan\tlength\tSLiMID");

//File for everyone who is alive in a year and their length in that year :)

writeFile("slim_alives.tsv","year\tpop\tid\tlenquan\tlength\tsex\tage");

}



Ben Haller

unread,
Mar 13, 2026, 12:21:01 PM (4 days ago) Mar 13
to slim-d...@googlegroups.com
Hi Ellie!

Ah, so if you're writing a VCF file from Python you're using the tskit VCF-writing facilities, which have nothing to do with SLiM.  You'll want to check out the tskit documentation and ask those folks if you have questions about that.  It won't write out SLiM individual IDs to the VCF, I think, but you should be able to get the SLiM individual IDs from the tree sequence metadata and write those IDs out to a separate file, parallel to the VCF file, that provides the identity of the individuals written to the VCF through a one-to-one correspondence between the VCF and the list of IDs, assuming tskit's VCF writing allows you to specify the exact order in which the sampled individuals should be written.

Yes, you can call treeSeqRememberIndividuals() on the individuals that you sample to ensure that they are kept in the tree sequence (so that you have a SLiM ID in the metadata for all of them).  That should work fine.  The SLiM individual IDs will then be the "key" that allows you to cross-index all of your output information across different places.

I hope that clarifies things; happy modeling!


Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University


Reply all
Reply to author
Forward
0 new messages