Recapitation vs. ancestral frequencies from vcf

48 views
Skip to first unread message

Jared Grummer

unread,
Apr 4, 2024, 2:25:14 PMApr 4
to slim-discuss
Hi Ben,

We are trying to make our simulated data as similar as possible to our empirical data, so we are hoping to initiate our two subpopulations with the variation we see in our empirical data. I believe we can do this through the readFromVCF() option, and I believe this requires the initializeSLiMOptions(nucleotideBased=T) option. However, I'm using the TreeSeq model because we're also trying to get information on ancestral chromosome tracts in admixed individuals (genome size ~2GB); our ancestral variation therefore is coming randomly from recapitating with the SLiM .trees output.

So my question is, is it possible to sprinkle in ancestral variation at specific sites, while using the TreeSeq model? I don't think so. I've also tried to see if we could add in the ancestral variation at particular sites in msprime, but I can't find anything about such an option.

Thanks!
Jared

Ben Haller

unread,
Apr 4, 2024, 3:11:34 PMApr 4
to Jared Grummer, slim-discuss
Hi Jared.  Well, if you recapitate to get your ancestry here, it is likely to conflict with the patterns of variation in your empirical data.  For example, a mutation in your empirical dataset might occur in, say, 100 particular genomes in the population, but the recapitation you do might show those 100 genomes as having no recent shared ancestry – so for that mutation to be present in all 100 genomes, it would have had to arise 100 separate times (or be lost however many separate times), in what would probably be a huge violation of parsimony.

It seems to me that what you might want to do is *infer* the ancestry based upon your empirical data, rather than just recapitating to get a random ancestry.  You could use tsinfer, for example, to get an inferred tree sequence that contains your empirically observed mutations in a way that is parsimonious with the inferred ancestry.  You could then load that inferred tree sequence in to SLiM (probably after also inferring dates with tsdate and adding the necessary SLiM metadata with pyslim).  I'm not familiar with tsinfer and tsdate myself, so I can't give you any more specific advice.  Perhaps Peter Ralph, or someone else on this list, might have more to say?

If you don't want to do that, then I'd say you might as well just load the VCF at the start of your SLiM model.  The mutations in the VCF will then get added to the specified genomes, originating in the first tick of the forward simulation.  When you recapitate later, the deep ancestry will be added backwards from that initial state.  If you're not going to infer the ancestry, that's really about the best you're going to do anyway; there will probably not be a single point in the recapitated ancestry where you could add a given mutation to have it end up in the correct genomes anyway (barring some lucky coincidence).  You certainly can add your own mutations in on the Python side in whatever way you want, but it requires modifying the tables that underly the tree sequence, so it's a bit complicated.  There should be examples of that sort of thing in the tskit/msprime documentation, I'd think.  But again, I think the structure of the tree sequence from recapitation won't match the patterns in your VCF anyway, so I'm not sure that approach really makes any sense at all.  I think tsinfer is the way to go.

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University


Jared Grummer wrote on 4/4/24 2:25 PM:
--
SLiM forward genetic simulation: http://messerlab.org/slim/
---
You received this message because you are subscribed to the Google Groups "slim-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slim-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/slim-discuss/6a2c02e0-0bba-4b59-a385-29c875014b67n%40googlegroups.com.

Jared Grummer

unread,
Apr 4, 2024, 4:44:52 PMApr 4
to slim-discuss
Hi Ben,

Thanks for the fast response! Maybe I wasn't clear, but instead of recapitating, I'm trying to load in ancestral variation in the first generation based on allele frequencies we see at different polymorphic sites across the genome for the two parental sub/populations. So, there would be no recapitating involved, just initiating the populations with individuals/variation from the vcf (p1.genomes.readFromVCF). Then in Python, we would determine the p1 vs p2 ancestry of each individual's chromosomes at each site from the vcf. If that makes sense, is it possible to add in the variation from a .vcf while using a TreeSeq model?

Also, I am getting errors while reading in my .vcf. I was hoping the SLiM website would have a .vcf that is referenced in the manual so I could compare and try to figure out (like chr22_filtered.recode.vcf), but I couldn't find one. My current error seems to imply that you cannot upload a .vcf with missing data:

(EidosInterpreter::NonnegativeIntegerForString): '.' could not be represented as an integer (decimal or negative exponent).

My .vcf has missing data as "./.". It makes sense, but my .vcf cannot contain missing data?

Thanks!

Jared

Ben Haller

unread,
Apr 4, 2024, 5:15:26 PMApr 4
to Jared Grummer, slim-discuss
Hi Jared,

OK, I guess I misunderstood.  Well, if you want to start your tree-seq model's run in SLiM with a VCF, you can simply load the VCF; that ought to work fine.  There just won't be any ancestry prior to the first forward-simulated generation; and if you recapitate later, the recapitated history won't make any real sense with respect to that initial VCF state – mutations will just appear out of nowhere, in the first forward-simulated generation, in patterns that don't make any sense with the ancestry tree.  If that's fine with you, then just load the VCF as usual, I think.

Yes, missing data in VCF is not supported; you need to decide whether the mutation is present or not, in each genome, because SLiM needs to know that.

The chr22_filtered.recode.vcf file is 68 MB, so I don't think I can post it here, but if you want it I can get it to you somehow.  :->  There's nothing special about it, though; it's just a VCF file, generated from an empirical dataset using standard open-source tools as described in the recipe where it is referenced.  Email me off-list if you want me to post it somewhere for you to download.  If you want to do things like specifying the mutation type and selection coefficients for mutations loaded from VCF, see sections 27.2.3 and 27.2.4 in the manual for all the gory details.


Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University


Jared Grummer wrote on 4/4/24 4:44 PM:

Jared Grummer

unread,
Apr 9, 2024, 11:54:23 AMApr 9
to slim-discuss
Hi Ben,

Sorry for my delay, currently working my way back across country after watching the solar eclipse!

It took some tweaking (don't need to use a nucleotide-based model and ancestral sequence), so I think I have the treeseq model working with the mutations added to the coalescent trees! I'm not doing any recapitating, just importing the ancestral variation via a .vcf of ancestral variation from our empirical data (with no missing data!).

I'm not in desperate need of the chr22_filtered.recode.vcf file, just thought it might be nice to have (maybe as a gzipped file?) on the website with the other distributed SLiM materials for working through the recipes, though I don't know in how many tutorials and such it would help.

Thanks again!
Jared

Jared Grummer

unread,
Apr 16, 2024, 9:09:52 PMApr 16
to slim-discuss
Hi Ben,

I just want to clarify that I'm reading in individuals/mutations correctly from my empirical .vcf files. I read in this thread (https://groups.google.com/g/slim-discuss/c/RIHaq2WvuUc/m/l64u9CwrAwAJ) about the three ways to read individuals from two separate pops into SLiM so they share the same mutations. I don't think I need the nonWF model and takeMigrants(), so I'm trying to use the third option you suggested - reading in a single .vcf file. I made a .vcf file where the first n_rbt individuals are in p1 and the last n_wct individuals are in p2, and I'm just trying to confirm that the code below parses those individuals into their respective populations correctly, then bumps up the pop. sizes and gets migration going. Does this look good?

1 late() {

//make two pops, with ongoing gene flow until the end of the simulation

sim.addSubpop("p1", n_rbt); // rainbow trout population

sim.addSubpop("p2", n_wct); // westslope population

c(p1,p2).genomes.readFromVCF("/RBT_slim_input.vcf", m1); // Read in a .vcf that was made in R and has RBT individuals first and WCT individuals second

p1.setSubpopulationSize(1000); // bump up RBT pop. size

p2.setSubpopulationSize(1000); // bump up WCT pop. size

p1.setMigrationRates(p2, 0.05); // migration rate from WCT into RBT

p2.setMigrationRates(p1, 0.01); // migration rate from RBT into WCT


Thanks a lot!

Jared

Ben Haller

unread,
Apr 17, 2024, 2:01:56 PMApr 17
to Jared Grummer, slim-discuss
Hi Jared!

I think that looks like it ought to work fine, if the VCF is set up correctly.  Nothing like verifying it in the Eidos terminal or some such, though.  Or perhaps writing out separate VCFs for p1 and p2 right after you read in, to check that they look like the original VCFs they ought to look like?  As people sometimes say, "trust but verify" – it might look fine to you, and to me, but that doesn't necessarily mean it's actually fine.  Not to mention that SLiM could have a bug, etc.  It's always a good idea – essential, I'd say! – to validate that your model is doing what you intend it to be doing, every step along the way, to the extent reasonable/possible.

Note that the scaling and the migration will happen in the same offspring generation operation, for the code as written.  So, for example, 5% of the 1000 RBT individuals generated by the first reproduction event will be migrant juveniles from WCT.  If that's what you want, then you're good; but if you wanted to expand the populations first, up to 1000, and *then* start migration, you'd want to start the migration in the following tick, then.  Not sure how much difference that would make really, but perhaps having one round of within-species reproduction to scramble up the loaded genetic diversity a bit would be better?  Maybe even multiple generations of that prior to mixing could be a good idea?  I don't know, just thinking out loud.  :->


Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University


Jared Grummer wrote on 4/16/24 9:09 PM:

Jared Grummer

unread,
Apr 17, 2024, 3:11:46 PMApr 17
to slim-discuss
Hi Ben,

Okay, thanks for the reply! Writing out vcfs immediately after reading in sounds like a good idea. I was trying to think of how to verify the correct reading of the VCFs in the Eidos terminal, but wasn't sure how to do that. I'll do that and verify! Fingers crossed...

And thanks for the reminder about the offspring generation and migration. I don't think it really matters in our model, due to the way we've generated the allele frequencies in the input VCFs. But I'll keep it in mind for something we may need to modify in the future!

Thanks again,
Jared
Reply all
Reply to author
Forward
0 new messages