SLiM subpopulation vcf output question

Scott Martin

unread,

Apr 24, 2019, 12:06:42 PM4/24/19

to slim-discuss

I'm hoping to use SLiM as a simulator inside of an ABC model, similar to the Sackman et al. 2019 paper. However, I'm having some trouble understanding parts of how the output are generated, and was hoping to get some clarification.

The focus of the model is trying to recover some population genetic parameters, so I'm simulating several subpopulations in a non-WF model with varying inputs. Eventually, what I would like to get out is genetic data that I can use to estimate individual pedigrees. So out of SLiM's various output options, I'm using outputVCFsample since it's very important to sample individuals and not genotypes. The problem is trying to compare and format these vcf files for later analysis. It's unlikely that all mutations are in every subpopulation (especially as the broader ABC model explores extreme parameter values), so the vcf files end up with different numbers of loci. I'm trying to come up with a way to address this problem while aiming to fully automate this cleanup process later.

Specifically, I'm wondering when using outputVCFsample, if a mutation is fixed as a substitution in a subpopulation but not the larger overall population, is it written to the vcf file or omitted? If it is not written, would the best option be to also write out all mutations with the outputMutations() option and match up missing loci with that to determine if they are fixed for the reference or mutation? Is it possible to force outputVCFsample to write out all loci across subpopulations?

I appreciate any help or suggestions,

-Scott

Ben Haller

unread,

Apr 24, 2019, 12:50:42 PM4/24/19

to slim-discuss

Hi Scott. Several issues here:

- It sounds like maybe you don't need a VCF file specifically; you write that "I'm using outputVCFsample since it's very important to sample individuals and not genotypes". You can use the lower-level Genome methods output() or outputMS() to output any vector of genomes you wish; you can easily arrange that the vector of genomes consists of pairs of genomes belonging to individuals, for example. If you draw a sample of individuals from a subpopulation with, say, "i_sample = sample(p1.individuals, 10, replace=F);", you can then get the genomes for those individuals with i_sample.genomes (arranged as the first and second genomes of the first individual, then the first and second genomes of the second individual, etc.; this is guaranteed by how Eidos handles vectorized property access, see the Eidos manual), and output them with i_sample.genomes.output(), for example. This sort of flexibility in the sample to be output is the reason for the existence of the low-level methods on Genome.

- It sounds like you want to output a sample from each of several subpopulations and you're concerned about matching up the information in those samples. It might be easier, perhaps, for you to output everything in one go. Again, the lower-level methods on Genome will allow you to do that, since you can call them on any vector of genomes. You could draw samples of individuals from each subpop with sample(), glom them together into one vector with c(), and then call sample_vector.genomes.output() – or outputMS(), or outputVCF() – on that one vector. If you are sampling a fixed number of individuals from each subpop, then you already know which subpop each genome comes from, in that glommed-together output; if not, you might want to output the subpopulation indices for the sampled genomes separately.

- To answer your original question directly: whether you are outputting in SLiM format, MS format, or VCF format, all segregating mutations will be included in the output even if they are fixed within the subpopulation or within the sample being output. SLiM only considers mutations "fixed" if they are fixed population-wide; it does not generally check for, or care about, fixation within smaller demographic units. (For MS output specifically, a "filterMonomorphic" flag is supported that allows mutations that are present in every sampled genome to be omitted from the output (see the doc); that option is not provided for the other output methods, only for MS output, and it is false by default.) So there should be no need to do the matching up with outputMutations() that you describe, I think. If mutations have fixed population-wide and have been replaced by Substitution objects, on the other hand, they will not be included in the output of any of these methods; outputFixedMutations() must be used to get them.

I hope this helps; let me know if you have further questions. Also, if you haven't noticed it already, section 13.8 in the current SLiM manual shows a recipe for using SLiM inside an ABC estimation procedure. I don't know very much about such matters, however; if you have suggested improvements for that recipe, they would be most welcome. Good luck!

Cheers,

-B.

Benjamin C. Haller

Messer Lab

Cornell University

Scott Martin

unread,

Apr 24, 2019, 4:58:49 PM4/24/19

to slim-discuss

Thanks for the quick response Ben! I hadn't thought about sampling individuals and then merging the samples together, that worked great. I'm hoping that I won't run into the issue of too few samples; if so I'll probably add some conditional if statements to sample the whole subpopulation instead, and then write out a 2nd text file with the subpop sizes.

I haven't looked at chapter 13 yet, but I'll let you know how the ABC part goes. Right now I'm working on calculating the summary statistics in R and the next step will be trying to set up the full ABC model with the calls to SLiM.

Thanks for putting all this work into SLiM, it's been the only simulator I've been able to find that has the level of flexibility I've been looking for and the documentation has been really nice.

Reply all

Reply to author

Forward