Adjusting VCF Sample Output

15 views

Skip to first unread message

Ellie Weise

unread,

Aug 28, 2025, 1:49:52 PMAug 28

to slim-discuss

Hi everyone,

I am working on a nonWF sexual reproduction simulation scenario in SLiM 5.0, and I am currently trying to clean up the sampling section of the model. I am modeling a long-lived species (lives up to 50 years) currently for 100 years, and I am sampling individuals between the age of 2 and 19 (with proportions spread in a normal distribution defined by the constant SampProp listed below) in years 90 to 95 of the simulation. I am sampling a total of 10% of the population in each year.

I pasted the code I'm currently using to sample, and I have two questions that I'm struggling with still:

1. How can I get the proportional sample for the age classes like I have below, but only get one .vcf at the end of the year, or even better one vcf for each simulation run? The way the code is written right now it concatenates a new .vcf output into the same file for each age and sample year, and it's a pretty unweildly file.

2. The total population (K) is currently 10000, and this sampling is very slow. Are there ways to speed up this part of the code? I'm estimating a marine population that is much larger than 10000 so I would like to increase K but I need to optimize first.

Let me know if any more clarification is needed and I'll happily provide what's needed :) thank you all so much in advance for your help!

All the best,

Ellie

// constant for sampling ages 2-19 - proportion

defineConstant("SampProp", c(0.003, 0.003, 0.008, 0.027, 0.041, 0.099, 0.109, 0.134, 0.152, 0.145, 0.125, 0.076, 0.034, 0.029, 0.008, 0.002, 0.002, 0.002));

// sample 5% of possible sample population during years 90 to 95, inclusive

90:95 late() {

// get the total number of indivdiuals of age 2 to 19

num_2_to_19 = sum(tabulate(p1.individuals.age, maxbin = 50)[2:19]);

nsamp = rbinom(1,num_2_to_19,0.1);

for(a in 1:17) {

TmpProp = SampProp[a];

// get number for the given age

ns = rbinom(1, nsamp, TmpProp);

if(ns > 0) {

// then sample them:

samps = p1.sampleIndividuals(ns, minAge = a+1, maxAge = a+1);

// output sample into a VCF format

samps.outputIndividuals();

samps.outputIndividualsToVCF("SampleVCF_SLiMTest.vcf",append=T);

// then write these out:

for(s in samps) {

s_name = paste0(s.sex, sim.cycle - s.age, "_0_", s.tag);

line = paste(s_name, "", "", sim.cycle, "0", "", "", sep = "\t");

writeFile("spip_samples.tsv", line, append = T);

}

Ben Haller

unread,

Aug 28, 2025, 4:06:43 PMAug 28

to Ellie Weise, slim-discuss

Hi Ellie!

The outputIndividuals() and outputIndividualsToVCF() methods can be called on any vector of individuals that you put together, so you can build a vector like:

allSamples = NULL;
for (a in 1:17) {
... sample individuals of age a, as you do now ...
allSamples = c(allSamples, samplesOfAgeA);
}
allSamples.outputIndividuals("/path/to/file.txt");
allSamples.outputIndividualsToVCF("/path/to/file.vcf");

In other words, just use c() to concatenate together all the sampled individuals and then make a single call to produce the output you want. Write each chunk of output to a file, with a filename that indicates what it needs to indicate (like the tick when the output was produced, maybe), rather than just writing it all to the console (which is what happens when you don't provide a filename), and then you'll have standalone files that should be easier to process.

As for performance, I wouldn't expect this to be particularly slow if you produce just one file (unless maybe you're modeling full genomes with tons of SNPs, in which case the VCF files might be huge and so might take a long time to generate); maybe it was slow because you were producing 19 files instead of 1? Anyhow, you can profile the model in SLiMgui to see where it is slow, and then ask for more help if it isn't clear what the problem is.

I would say that it will be much faster to call writeFile() once with a vector of strings, one per line, that you have assembled, than to call writeFile() appending one line at a time to a large file; that might be a big performance problem there. I'd also note that you could collect the data you want to log into a DataFrame object and then serialize that to CSV in a single call; that will be the fastest output option, probably, and much faster than using paste() to add all the commas and so forth; note that DataFrame and serialize() are documented in the Eidos manual. But anyhow, profiling is the first step, to see where the performance problems are; don't waste time optimizing things that aren't a problem anyway. :-> The SLiM manual discusses profiling in detail, and it is also covered in the SLiM workshop. Good luck!

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University

Ellie Weise wrote on 8/28/25 1:49 PM:

--
SLiM forward genetic simulation: http://messerlab.org/slim/
---
You received this message because you are subscribed to the Google Groups "slim-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slim-discuss...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/slim-discuss/d54bbb22-1d85-460d-91ec-5e1238df5807n%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages