Connecting Individuals in pyslim recapitation to SLiM metadata

55 views
Skip to first unread message

Ellie Weise

unread,
Mar 12, 2026, 3:52:39 PMMar 12
to slim-discuss
Hi everyone,

I have what I think is a housekeeping issue connecting my individuals generated in SLiM to (eventual) genotypes generated using recapitation and mutation functions in pyslim. I will say right off the bat that my python skills are pretty poor, but I'm wondering if I missing something in the initializeTree or OutputTreeSeq calls. I haven't changed much from the defaults in the recipes for recapitation at this point since it's pretty new to me :)

So I have a SLiM script that creates a large (K = 10,000 fish), single population of long-lived "Halibut" for 200 years, with population growth between year 120 and 200. I have sampling from years 195 and 200, and they are tracked in an output file which looks like this: 

ID    syears_pre  pop_pre   lenquan     length      SLiMID

M191_0_21634                  194   0                 0.554196    43.8083     21633

F190_0_21101                  194   0                 0.321133    54.3978     21100

M190_0_21234                  194   0                 0.740409    65.4842     21233

M190_0_20931                  194   0                 0.308491    49.7787     20930

F190_0_21110                  194   0                 0.792312    72.163      21109

F190_0_21211                  194   0                 0.0735019   40.7073     21210

M190_0_21397                  194   0                 0.852234    70.9921     21396


I figured that the tree sequence would retain those individual IDs from the simulation so I could connect the metadata generated with the genotypes from the recapitation script, but the output of the .vcf seems to just have individual numbers ind1-ind[population size of year 200].  I am wondering if there's a way to just get genotypes for my sampled individuals from their respective years, and how to retain their individual numbers?

I have pasted what I think are the relevant python code section below (it is basically the tutorial at this point though), let me know if you need any other information or code sections and I will add those as well :) thank you so much in advance for your help!

All the best,
Ellie
------
slim_ts = tskit.load("D:/SLiMOutputs/decap.trees/chromosome_1.trees")

#NTS# recapitation takes several hours/chromosome for this pop size
recap_ts = pyslim.recapitate(slim_ts, recombination_rate=1e-5, ancestral_Ne=1000)

next_id = pyslim.next_slim_mutation_id(recap_ts)
ts = msprime.sim_mutations(
           recap_ts,
           rate=1e-8,
           model=msprime.SLiMMutationModel(type=0, next_id=next_id),
           keep=True,
)

nts = pyslim.generate_nucleotides(ts)
nts = pyslim.convert_alleles(nts)
with open("D:/SLiMOutputs/decap.trees/RecapTest_Chr1_python.vcf", "w") as vcffile:
    nts.write_vcf(vcffile)






Ben Haller

unread,
Mar 12, 2026, 6:11:41 PMMar 12
to slim-d...@googlegroups.com
Hi Ellie!

I'm a little confused because you refer to tree-sequence recording but also VCF.  Are you using both?  It would be helpful to get complete (but minimal for showing the question) SLiM and python scripts, if this turns out not to be an easy answer.


You write:

> I figured that the tree sequence would retain those individual IDs from the simulation so I could connect the metadata generated with the genotypes from the recapitation script

Yes, for individuals that are retained in the tree sequence or are present in the individual table for other reasons, the SLiM id of the individuals is present in the individuals metadata.  If there are individuals you need to be present in the tree sequence that aren't, you can call treeSeqRememberIndividuals() to ensure that those individuals are kept.  See the pyslim doc for lots of guidance on accessing metadata for things in the tree sequence, I think.


But then you also write:

> but the output of the .vcf seems to just have individual numbers ind1-ind[population size of year 200]

Yes, SLiM IDs don't get written into VCF output by SLiM.  In some cases they could be, potentially, but in other cases they couldn't really be, since the "individuals" in the SLiM VCF output don't necessarily correspond to actual individuals (since you can just write a vector of haplosomes to VCF, too).  So if you are, in fact, writing a vector of individuals to VCF, you can do a parallel write of all their individual SLiM IDs to a separate file, and then match up IDs to VCF individuals, one to one, using that file.

I hope that helps; happy modeling!

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University
--
SLiM forward genetic simulation: http://messerlab.org/slim/
---
You received this message because you are subscribed to the Google Groups "slim-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slim-discuss...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/slim-discuss/d990cd41-65b4-42e6-bd95-1e54e8297cd2n%40googlegroups.com.

Ellie Weise

unread,
Mar 13, 2026, 9:37:08 AMMar 13
to slim-discuss
Hi Ben,

Thanks! I think that is absolutely getting at my question but I do have some follow ups :) I am not writing a VCF in SLiM, just the tree sequence as an output, and then the VCF comes from the recapitation script in pyslim. For the SLiM side, I am wondering where to put the RememberTreeSeq line (in green in the chunk below)? I don't need all of the individuals at the end, but at least the sample would be helpful. Here is the sampling section of the code where we do the sampling, would I be able to add the RememberTreeSeq there to keep track of those individuals? My goal is to keep the individual names consistent, or at least have the ability to match later, between the SLiM metadata that we generate and then store in the sampling and pedigree outputs, and the pyslim VCF output that we generate later.

// sample 3% of possible sample population during years 195-200, inclusive

195:200 early() {

// get the total number of individuals of age 2 to 19

num_2_to_19 = sum(tabulate(p1.individuals.age, maxbin = 50)[2:19]);

nsamp = rbinom(1,num_2_to_19,0.03);

YearSamp = NULL;

for(a in 1:17) {

TmpProp = SampProp[a];

// get number for the given age

ns = rbinom(1, nsamp, TmpProp);

if(ns > 0) {

// then sample them:

AgeSamp = p1.sampleIndividuals(ns, minAge = a+1, maxAge = a+1);

YearSamp = c(YearSamp,AgeSamp);

}

}

//YearSamp.setValue("extant_tags",0.0);

YearSamp.fitnessScaling = 0.0;

YearSamp.RememberTreeSeq();

// then write these into the output

for(s in YearSamp) {

s_name = paste0(s.sex, sim_year(sim.cycle) - s.age, "_0_", s.tag);

line = paste(s_name, "", "", sim_year(sim.cycle), "0", "", "",s.x,s.y, s.pedigreeID,sep = "\t");

writeFile("spip_samples.tsv", line, append = T);

}

}


The other place where I was thinking there might be issues is in the initialization, I attached that code below. I made the treeseq related lines green because there's several constants defined for reproduction/growth that I don't think are relevant except to give a sense of what we're tracking for the individuals? 


initialize() {

initializeSLiMModelType("nonWF");

initializeSLiMOptions(keepPedigrees=T);

initializeTreeSeq(simplificationRatio=INF,timeUnit="generations");

setwd("C:/Users/danielruzzante/Documents/SLiMOutputs/K10000_80000");


//define carrying capacities for original (K1) and growth period (K)

defineConstant("K", 80000);

defineConstant("K1", 10000);


// "species 1" mortality rates for ages 0 to 50. (NOTE: All 50-year-olds die)

defineConstant("L", c(0.0, 0.13, 0.13, 0.13, 0.13, 0.15, 0.18, 0.21, 0.21, 0.22, 0.22, 0.21, 0.21, 0.21, 0.20, 0.20, 0.20, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 1.00));

// male fecundity rates for ages 0 to 50.

defineConstant("FecM", c(0.000, 0.000, 0.002, 0.012, 0.044, 0.107, 0.189, 0.275, 0.355, 0.428, 0.492, 0.548, 0.595, 0.637, 0.674, 0.710, 0.745, 0.782, 0.823, 0.868, 0.918, 0.970, 1.024, 1.079, 1.134, 1.190, 1.247, 1.305, 1.364, 1.424, 1.486, 1.550, 1.616, 1.684, 1.754, 1.826, 1.900, 1.976, 2.054, 2.135, 2.218, 2.303, 2.391, 2.481, 2.573, 2.668, 2.765, 2.865, 2.968, 3.073));

// female fecundity rates for ages 0 to 50 - divided by 4 to decrease total offspring number

defineConstant("FecF", c(0.00000, 0.00000, 0.00000, 0.00025, 0.00100, 0.00425, 0.01575, 0.04525, 0.09825, 0.16225, 0.22275, 0.27600, 0.32450, 0.37075, 0.41675, 0.46350, 0.51125, 0.55950, 0.60800, 0.65650, 0.70400, 0.75000, 0.79375, 0.83525, 0.87450, 0.91225, 0.94950, 0.98650, 1.02375, 1.06125, 1.09925, 1.13850, 1.17825, 1.21925, 1.26150, 1.30450, 1.34850, 1.39375, 1.44000, 1.48750, 1.53600, 1.58550, 1.63625, 1.68825, 1.74150, 1.79575, 1.85125, 1.90800, 1.96600, 2.02550));

//The female growth length means for 0 to 50

defineConstant("lengthM",c(1, 8.25, 25.84, 42.31, 56.64, 68.99, 79.55, 88.47, 95.93, 102.08, 107.11, 111.17, 114.44, 117.09, 119.27, 121.16, 122.89, 124.55, 126.22, 128, 129.92, 131.92, 133.96, 135.97, 137.94, 139.85, 141.72, 143.56, 145.36, 147.15, 148.92, 150.68, 152.44, 154.2, 155.96, 157.72, 159.48, 161.24, 163, 164.76, 166.52, 168.28, 170.04, 171.8, 173.56, 175.32, 177.08, 178.84, 180.6, 182.36, 184.12));

//The female growth length sds

defineConstant("lengthSDM",c(0.31, 2.48, 7.23, 10.99, 13.72, 15.68, 17.05, 18, 18.65, 19.08, 19.37, 19.55, 19.67, 19.75, 19.8, 19.83, 19.86, 19.87, 19.88, 19.89, 19.88, 19.87, 19.84, 19.8, 19.76, 19.71, 19.65, 19.58, 19.51, 19.43, 19.34, 19.25, 19.15, 19.04, 18.92, 18.79, 18.66, 18.52, 18.38, 18.22, 18.06, 17.89, 17.71, 17.53, 17.33, 17.13, 16.93, 16.71, 16.49, 16.26, 16.02));

//The male growth length means for 0 to 50

defineConstant("lengthF",c(1, 6.31, 25.84, 44.33, 60.85, 75.54, 88.55, 100.04, 110.17, 119.08, 126.93, 133.87, 140.05, 145.64, 150.77, 155.6, 160.2, 164.58, 168.72, 172.63, 176.31, 179.74, 182.91, 185.8, 188.42, 190.83, 193.08, 195.21, 197.28, 199.31, 201.3, 203.27, 205.23, 207.2, 209.16, 211.13, 213.09, 215.06, 217.02, 218.99, 220.95, 222.92, 224.88, 226.85, 228.81, 230.78, 232.74, 234.71, 236.67, 238.64, 240.6));

//The male growth length sds

defineConstant("lengthSDF",c(0.29, 1.8, 6.84, 10.88, 13.89, 16.09, 17.67, 18.76, 19.5, 19.98, 20.26, 20.4, 20.44, 20.41, 20.32, 20.19, 20.02, 19.82, 19.59, 19.35, 19.08, 18.81, 18.54, 18.28, 18.02, 17.77, 17.53, 17.29, 17.05, 16.8, 16.56, 16.3, 16.04, 15.77, 15.49, 15.21, 14.91, 14.61, 14.3, 13.98, 13.66, 13.32, 12.98, 12.63, 12.27, 11.9, 11.53, 11.15, 10.76, 10.36, 9.95));


// constant for sampling ages 2-19 - proportion

defineConstant("SampProp", c(0.003, 0.003, 0.008, 0.027, 0.041, 0.099, 0.109, 0.134, 0.152, 0.145, 0.125, 0.076, 0.034, 0.029, 0.008, 0.002, 0.002, 0.002));


//initialize genome section

initializeSex();

initializeMutationType("m1", 0.5, "f", 0.0);

initializeGenomicElementType("g1", m1, 1.0);


// make halibut-type genome assembly

ids = 1:24;

lengths = c(30597443, 11297635, 32675587, 28512877, 30753830,

28531750, 28364880, 23524515, 29170869, 27686247,

22214583, 21256114, 29083852, 24953334, 25621292,

27148517, 24735652, 16627714, 20912647, 20800843,

25055940, 26312018, 18431247, 21435488);

for (id in ids, length in lengths)

{

initializeChromosome(id, length, "A");

initializeMutationRate(0);

initializeRecombinationRate(1e-5);

initializeGenomicElement(g1);

}

// delete any files that will be appended to and write the headers on them that CKMRpop expects

writeFile("spip_pedigree.tsv", "year\tpop\tkid\tpa\tma\tlenquan\tlength\tSLiMID");

writeFile("spip_prekill_census.tsv", "year\tpop\tage\tmale\tfemale");

writeFile("spip_postkill_census.tsv", "year\tpop\tage\tmale\tfemale");

writeFile("spip_samples.tsv", "ID\tsyears_pre\tpop_pre\tsyears_post\tpop_post\tsyears_dur\tpop_dur\tlenquan\tlength\tSLiMID");

//File for everyone who is alive in a year and their length in that year :)

writeFile("slim_alives.tsv","year\tpop\tid\tlenquan\tlength\tsex\tage");

}



Ben Haller

unread,
Mar 13, 2026, 12:21:01 PMMar 13
to slim-d...@googlegroups.com
Hi Ellie!

Ah, so if you're writing a VCF file from Python you're using the tskit VCF-writing facilities, which have nothing to do with SLiM.  You'll want to check out the tskit documentation and ask those folks if you have questions about that.  It won't write out SLiM individual IDs to the VCF, I think, but you should be able to get the SLiM individual IDs from the tree sequence metadata and write those IDs out to a separate file, parallel to the VCF file, that provides the identity of the individuals written to the VCF through a one-to-one correspondence between the VCF and the list of IDs, assuming tskit's VCF writing allows you to specify the exact order in which the sampled individuals should be written.

Yes, you can call treeSeqRememberIndividuals() on the individuals that you sample to ensure that they are kept in the tree sequence (so that you have a SLiM ID in the metadata for all of them).  That should work fine.  The SLiM individual IDs will then be the "key" that allows you to cross-index all of your output information across different places.

I hope that clarifies things; happy modeling!


Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University


Ellie Weise

unread,
May 20, 2026, 1:13:18 PM (11 days ago) May 20
to slim-discuss
Hi Ben,

Thanks again for the clarification! I think I have narrowed down where the disconnect is happening. I simplified my genetics to one nucleotide-based chromosome with a burn-in just to see what was going on better. But what I noticed is that for the outputs the individuals are tagged as p1:4 digit number for the individual in a given year. But that number resets for each tick, causing weird duplicates in my concatenated VCFs across sampling years and the numbers don't match up with the pedigreeID. I am wondering how to switch this number in the VCF to the pedigree ID so I can connect my outputs and get unique identifiers for all the individuals in my VCF across sampling years? Here is the sample code as written right now:

(BurnIn+195):(BurnIn+200) early() {

// get the total number of indivdiuals of age 2 to 19

num_2_to_19 = sum(tabulate(p1.individuals.age, maxbin = 50)[2:19]);

nsamp = rbinom(1,num_2_to_19,0.05);

YearSamp = NULL;

for(a in 1:17) {

TmpProp = SampProp[a];

// get number for the given age

ns = rbinom(1, nsamp, TmpProp);

if(ns > 0) {

// then sample them:

AgeSamp = p1.sampleIndividuals(ns, minAge = a+1, maxAge = a+1);

YearSamp = c(YearSamp,AgeSamp);

}

}

//YearSamp.setValue("extant_tags",0.0);

YearSamp.fitnessScaling = 0.0;


// output sample into a VCF format

YearSamp.outputIndividualsToVCF(paste0("SampleVCF_SLiMTest_",sim.cycle,".vcf"),append=F);

// then write these out:

for(s in YearSamp) {

s_name = paste0(s.sex, sim_year(sim.cycle) - s.age, "_0_", s.tag);

line = paste(s_name, "", "", sim_year(sim.cycle), "0", "", "",s.x,s.y, s.pedigreeID,sep = "\t");

writeFile("spip_samples.tsv", line, append = T);

}


}



And the top of the 'spip_samples.tsv' file that I have been saving the metadata in has the following structure (key covariates for us: sex, individual birth year, sample year, and length):

Screenshot 2026-05-20 141125.png


Thanks in advance again for your help, I really appreciate it!

Ben Haller

unread,
May 20, 2026, 2:04:21 PM (11 days ago) May 20
to slim-d...@googlegroups.com
Ah, yes.  You don't want to use the individual indexes to identify individuals; as you say, they are not unique.  You need to use pedigree IDs.  See the slimHaplosomePedigreeIDs list, as shown in section 28.2.3 for example.  There is a bit of commentary in that section.


Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University


Ellie Weise

unread,
May 20, 2026, 2:24:48 PM (11 days ago) May 20
to slim-discuss
Hi Ben,

Ah okay, I have now found those and read through that section, that's great, thank you! I am wondering if I'm just not seeing it in the various individual output options, is there a way to label my individuals in the output with the PedigreeIDs instead of the default format (replacing the i000 value)? Sorry if that is a silly question, I'm just not seeing it in the help files/section of the manual at the moment, and changing those names in the output would save me a lot of headache converting/matching for the downstream analyses!

All the best,
Ellie

Ben Haller

unread,
May 21, 2026, 12:46:51 AM (11 days ago) May 21
to slim-d...@googlegroups.com
Hi Ellie,

No, that's not an option at present, sorry.  But it's pretty straightforward to work with slimHaplosomePedigreeIDs, I think?  If necessary, you could probably easily write a post-processing script that would edit the output VCF files to be in the format you want them to be in, I guess, and then go from there.

Cheers,
-B.

Peter Ralph

unread,
May 21, 2026, 12:53:53 PM (10 days ago) May 21
to slim-d...@googlegroups.com
Hey, folks - in python, you can (a) pull SLiM's pedigree IDs out of the metadata, and then (b) use the individual_names argument to write_vcf to make these the individual names in the VCF.  I've just added a note on this to the docs:
that should be visible later today; see the new code here: https://github.com/tskit-dev/pyslim/pull/424/changes


Hope that helps,
 Peter


From: 'Ben Haller' via slim-discuss <slim-d...@googlegroups.com>
Sent: Wednesday, May 20, 2026 9:46 PM
To: slim-d...@googlegroups.com <slim-d...@googlegroups.com>
Subject: Re: Connecting Individuals in pyslim recapitation to SLiM metadata
 
Hi Ellie, No, that's not an option at present, sorry. But it's pretty straightforward to work with slimHaplosomePedigreeIDs, I think? If necessary, you could probably easily write a post-processing script that would edit the output VCF files
ZjQcmQRYFpfptBannerStart
This message originated outside the UO email ecosystem.
Use caution with links and attachments. Learn more about this email warning tag.
 
ZjQcmQRYFpfptBannerEnd

Ben Haller

unread,
May 25, 2026, 3:09:29 AM (7 days ago) May 25
to slim-discuss
Thanks Peter, that's a nice solution.  I've opened an issue (https://github.com/MesserLab/SLiM/issues/631) that a solution within SLiM would be good too; VCF output is a bit different between SLiM and tskit, and of course someone might be using pedigree tracking without using tree-sequence recording, so it would be convenient for this to be supported intrinsically.  :->

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University

Reply all
Reply to author
Forward
0 new messages