Neutral evolution with custom reference fasta and SNPs

235 views
Skip to first unread message

Tiina Sävilammi

unread,
Nov 30, 2023, 2:58:41 PM11/30/23
to slim-discuss
Hi, 

I have a question about running simulations of neutral evolution. We are aiming to use SLim 4 to simulate a short period of random drift in an ancestral population by combining examples 4.1 and 18.12. The SNPs for the ancestral population are from a single chromosome. N:s in the reference fasta are converted to A:s as SNPs don't overlap unknown positions. The data is imputed using Beagle software and only 1179 biallelic SNPs are kept.We are able to run the simulation according to example 4.1 and make two replicate of the same ancestral population which we run for six generations and get a Fst close to zero in the first generation:
// set up a simple neutral simulation(manual p. 95) and input fa/vcf (manual p. 519) initialize() { initializeMutationRate(1e-7); initializeMutationType("m1", 0.5, "f", 0.0); //neutral mutations initializeGenomicElementType("g1", m1, 1.0); initializeGenomicElement(g1, 0, 99999); //chr length initializeRecombinationRate(1e-8); } 1 early() { sim.setValue("FST", 0.0); sim.addSubpop("p1", 42); //create a subpopulation of 42 individuals sim.addSubpop("p2",42); p1.setSubpopulationSize(250); //change population size p2.setSubpopulationSize(250); log = community.createLogFile("sim_log_generic.txt", logInterval=1); log.addCycle(); log.addCustomColumn("FST", "calcFST(p1.genomes, p2.genomes);"); } 6 late() { sim.simulationFinished(); }

the output is then as I expect:
cycle,FST 1,0.001001 2,0.00184957 3,0.00224949 4,0.00267929 5,0.00368962 6,0.00286272
However when we try to plug in a reference fasta file and a vcf file to read the genotypes from, the initial Fst is way too high:
// set up a simple neutral simulation (manual p. 519) initialize() { initializeSLiMOptions(nucleotideBased=T); length = initializeAncestralNucleotides("chr1.fa"); defineConstant("L", length); initializeMutationTypeNuc("m1", 0.5, "f", 0.0); //neutral mutations m1.convertToSubstitution = F; initializeGenomicElementType("g1", m1, 1.0, mmJukesCantor(0.0)); //jukescantor for selection only from standing genetic variation initializeGenomicElement(g1, 0, L-1); //chr length from fasta initializeRecombinationRate(1e-8); } 1 early() { sim.addSubpop("p1", 42); //create a subpopulation of 42 individuals (same as in the vcf) sim.addSubpop("p2",42); p1.genomes.readFromVCF("anc_sub01_chr1_imputed_bi.vcf", m1); //read VCF file as mutations in the target genomes. p2.genomes.readFromVCF("anc_sub01_chr1_imputed_bi.vcf", m1); p1.setSubpopulationSize(250); //change population size p2.setSubpopulationSize(250); log = community.createLogFile("sim_log_fromvcf.txt", logInterval=1); log.addCycle(); log.addCustomColumn("FST", "calcFST(p1.genomes, p2.genomes);"); } 6 late() { sim.simulationFinished(); }
The output then is (with the initial FST is very high):
cycle,FST 1,0.300159 2,0.299871 3,0.298628 4,0.299064 5,0.300309 6,0.300746

Can you please tell me what I am doing wrong? I tried to find the solution from the manual but may have missed the information as I am a beginner with SLiM.

With best regards,

Tiina Sävilammi

Ben Haller

unread,
Nov 30, 2023, 3:26:16 PM11/30/23
to Tiina Sävilammi, slim-discuss
Hi Tiina!

It looks like you read both p1 and p2 from the same VCF file, but in two separate read operations.  You might think that this would mean they share all of their mutations (and be identical, initially), but in fact they share *none* of their mutations, because each VCF read creates new mutations to represent the mutations from that read operation.  See the discussion of IBS versus IBD in section 1.5, for example.  The FST is then based upon both populations having a bunch of SNPs that are all private, and thus quite high.

You would have a couple of choices to remedy this.  Spoiler: (3) is the best solution, almost certainly.  (1) Make a bigger VCF file containing all individuals, read it in, and then get the individuals into their final subpopulations using takeMigrants(); this is a fairly clean and simple solution, but requires the use of a nonWF model (which it sounds like you don't want/need).  (2) Read in the two VCFs as you do now, and then do a post-pass that scans over all of the mutations and uniques each duplicated pair down to a single canonical mutation shared by both subpopulations.  This would be a bit icky and slow, and I don't recommend it; too much trouble, too fiddly and bug-prone, and probably very poor performance for a large number of SNPs.  (3) The best solution for most situations: refactor things so you read in only one VCF file.  This is why readFromVCF() is a method on Genome, not Subpopulation – precisely to provide this level of flexibility, letting you read in VCF data that spans multiple subpopulations in whatever way you wish.  Basically, you would process your VCF file to have all individuals in a single file – p1 individuals first, p2 individuals second.  Then, do something like this:

c(p1,p2).genomes.readFromVCF(path);

That will read all the individuals at once, and so you won't end up with duplicate mutations.  I hope this helps; happy modeling!

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University


Tiina Sävilammi wrote on 11/30/23 2:58 PM:
--
SLiM forward genetic simulation: http://messerlab.org/slim/
---
You received this message because you are subscribed to the Google Groups "slim-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slim-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/slim-discuss/9796a122-63ba-4bb1-882e-5315a708beb0n%40googlegroups.com.

Danny Sadler

unread,
Feb 27, 2024, 9:59:06 AM2/27/24
to slim-discuss

Hi Benjamin,

 

I have been working with Tiina to use the slim software to calculate FST values of a simulated population after six generations, then using these simulated VCFs to compare statistics with our real populations after six generations to disentangle the effect of drift and selection. Following your previous advice is the below code correct?  We also wanted to calculate FST values for our real data so it is the same methodology, and to make sure it is identical to simulation methods. I wrote the code (listed below the simulation code) and found the FST values to be very high (0.3) whilst the simulated values are much smaller (0.0012-0.0013) and tenfold higher than other methods such as through VCFtools. Could you please advice why this may be the case?

 

Kind regards,

Danny

 

Code for simulating six generations:

// set up a simple neutral simulation (manual p. 519)

initialize() {

  initializeSLiMOptions(nucleotideBased=T);

  length = initializeAncestralNucleotides("chr2.fa");

  defineConstant("L", length);

  initializeMutationTypeNuc("m1", 0.5, "f", 0.0);  //neutral mutations

m1.convertToSubstitution = F;

  initializeGenomicElementType("g1", m1, 1.0, mmJukesCantor(0.0));  //jukescantor for selection only from standing genetic variation

  initializeGenomicElement(g1, 0, L-1);  //chr length from fasta

  initializeRecombinationRate(1e-8);

}

 

1 early() {

  sim.addSubpop("p1", 42); //create a subpopulation of 42 individuals

  p1.genomes.readFromVCF("chr2.vcf.vcf", m1);  //creates the SNPs from the VCF file as mutations in the target genomes. This example contains 1192 snps

  //p2.genomes.readFromVCF("chr2.vcf.vcf", m1);

  p1.setSubpopulationSize(500);

 

}

2 early() {

  sim.addSubpop("p2", 250); //create a subpopulation of 42 individuals

  sim.addSubpop("p3",250);  //create a subpopulation of 42 individuals

  p2.setMigrationRates(c(p1), c(0.5));

  p3.setMigrationRates(c(p1),c(0.5));

 

  log = community.createLogFile("sim_log_2.txt", logInterval=1);

  //for each interal do this

  log.addCycle();

  log.addCustomColumn("FST", "calcFST(p2.genomes, p3.genomes);");

  log.addCustomColumn("mutations","sim.mutations.size();");

 

6 late() {

   //sim.outputFull();

   // get a sample of 42 individuals, and then get the genomes

   s1 = sample(p2.individuals, 42).genomes;

   s2 = sample(p3.individuals, 42).genomes;

   sampleFST=calcFST(p3.genomes, p3.genomes);

    catn("sample FST "+ sampleFST);

   p1.genomes.outputVCF("simulated_p1.vcf",simplifyNucleotides=T);

   p2.genomes.outputVCF("simulated_p2.vcf",simplifyNucleotides=T);

   sim.simulationFinished();

}

 

Slim code 2 for FST values of real data six generations later (note it is written in a format for a batch script to run each chromosome separately and run the simulation 50 times):

 

#!/bin/bash

chr=$1

inputvcf=$2

 

# Path to the Slim executable

slim_executable="slim"

 

# Number of simulations to run

num_simulations=1

 

# Loop to run the simulation 50 times

for ((i = 1; i <= num_simulations; i++)); do

 

    slimfiletext='

    initialize() {

      initializeSLiMOptions(nucleotideBased=T);

      length = initializeAncestralNucleotides("'"${chr}"'.fa");

      defineConstant("L", length);

      initializeMutationTypeNuc("m1", 0.5, "f", 0.0);  //neutral mutations

      m1.convertToSubstitution = F;

      initializeGenomicElementType("g1", m1, 1.0, mmJukesCantor(0.0));  //jukescantor for selection only from standing genetic variation

      initializeGenomicElement(g1, 0, L-1);  //chr length from fasta

      initializeRecombinationRate(1e-8);

    }

    1 early() {

      sim.addSubpop("p1", 41); //create a subpopulation of 42 individuals

      sim.addSubpop("p4", 41);

      p1.genomes.readFromVCF("'"${inputvcf}"'", m1);  //creates the SNPs from the VCF file as mutations in the target genomes. This example contains 1192 snps

      p4.genomes.readFromVCF("RS1_chr1_inpute.vcf.vcf", m1);

      p1.setSubpopulationSize(500);

      p4.setSubpopulationSize(500);

    }

    1 early() {

      log = community.createLogFile("sim_log_'"${chr}"'_'"${i}"'.txt", logInterval=1);

      //for each internal do this

      log.addCycle();

      log.addCustomColumn("FST", "calcFST(p1.genomes, p4.genomes);");

      log.addCustomColumn("mutations","sim.mutations.size();");

      sampleFST=calcFST(p1.genomes, p4.genomes);

      catn("sample FST "+ sampleFST);

 

    }'

 

    echo "${slimfiletext}" > slimsim_"${chr}"_"${i}".txt

 

    # Run the Slim simulation with a unique output file and log file for each run

    $slim_executable slimsim_"${chr}"_"${i}".txt > output_"${chr}"_"${i}".log 2>&1

done

Ben Haller

unread,
Feb 27, 2024, 10:16:10 AM2/27/24
to Danny Sadler, slim-discuss
Hi Danny,

It looks like you are still doing the same thing that I explained earlier is problematic:


      p1.genomes.readFromVCF("'"${inputvcf}"'", m1);  //creates the SNPs from the VCF file as mutations in the target genomes. This example contains 1192 snps

      p4.genomes.readFromVCF("RS1_chr1_inpute.vcf.vcf", m1);

As I wrote last time:


> It looks like you read both p1 and p2 from the same VCF file, but in two separate read operations.  You might think that this would mean they share all of their mutations (and be identical, initially), but in fact they share *none* of their mutations, because each VCF read creates new mutations to represent the mutations from that read operation.  See the discussion of IBS versus IBD in section 1.5, for example.  The FST is then based upon both populations having a bunch of SNPs that are all private, and thus quite high.

So if there are any mutations shared between those two VCF files – and if I understand correctly what you are doing, there are – then your methodology here is not correct.  See that previous reply for my recommendations about how to fix the problem.  Apologies if I have misunderstood the situation; if so, please clarify.

As for your question "is the below code correct?" please note that I don't have the time to do code reviews for everybody using SLiM (which has hundreds of users).  I'm happy to answer specific questions about SLiM here, but learning to test and debug one's own code is an essential part of working with SLiM, or any other task involving programming.  I would suggest that you try hard to use the available tools in SLiM and SLiMgui to answer your question for yourself, and then of course feel free to come back here to ask a more specific question if you hit a roadblock.  For example, the problem last time around was that the mutations in your two populations were supposed to be the same mutation objects – shared mutations – but were in fact different, because you were creating them separately from each other.  It is straightforward to use the Eidos console to investigate whether that issue has now been resolved; if you're unfamiliar with using the Eidos console, the SLiM Workshop is a good place to start.

Good luck, and happy modeling!


Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University


Danny Sadler wrote on 2/27/24 8:59 AM:

Tiina Sävilammi

unread,
Mar 14, 2024, 4:50:10 AM3/14/24
to slim-discuss
Hi Benjamin,

It is Tiina again as I had more detailed questions for you this time. First, thank you for your patience with our problem! I try to be more clear what we would like to achieve.

So, previously we simulated six generations of drift from an ancestral population and after you helped us, and found that the amount of drift is very low (Fst is something like 0.002 after six generations). We agree that this could indeed be the case.

Now, we would like to compare the estimated Fst under drift-only scenario to our real data set where we have two vcf:s of individuals from two F6 populations. Those two real F6 populations originated from that same ancestral population six generations ago and have been under selection. They do share alleles but are now two distinct populations. 

We thought that we would read the populations in and estimate the Fst in SLiM to make the Fst estimation identical to the drift-only scenario. How should we read in the genotypes of those two populations to avoid the previous problem (having private SNPs)? We can of course combine the two vcf:s but then do we need to somehow separate the individuals in SLiM? Or is this possible at all? If not, can you suggest what tool we should use to repeat the SLiM Fst estimation procedure? 

Here is the second part of the code which we cleaned up as it previously contained some misleading comments etc. The two selected F6 populations in the code are p1 and p4. Here is a new version: 

#!/bin/bash
chr=$1  #here we read in one chromosome name
inputvcf=$2  # genotypes of individuals from F6 line 1, which are descendants of a single ancestral line
inputvcf2=$3 # genotypes of individuals from F6 line 2, which are descendants of a single ancestral line# Path to the Slim executable

slim_executable="slim"
# Number of simulations to run
num_simulations=50

# Loop to run the simulation 50 times
for ((i = 1; i <= num_simulations; i++)); do
   #here we generate a simulation text for one of the 50 runs
   slimfiletext='
   initialize() {
     initializeSLiMOptions(nucleotideBased=T);
     L = initializeAncestralNucleotides("'"${chr}"'.fa");
     initializeMutationTypeNuc("m1", 0.5, "f", 0.0); //we initialize neutral mutations from standing genetic variation

     m1.convertToSubstitution = F;
     initializeGenomicElementType("g1", m1, 1.0, mmJukesCantor(0.0)); //jukescantor for selection only from standing genetic variation
     initializeGenomicElement(g1, 0, L-1); //we initialize the genome size as the length of the fasta
     initializeRecombinationRate(1e-8);
   }
   1 early() {
     sim.addSubpop("p1", 41); //we create a subpopulation of 41 individuals
     sim.addSubpop("p4", 41); //we create another subpopulation of 41 individuals
     p1.genomes.readFromVCF("'"${inputvcf}"'", m1); //PROBLEMATIC: we read a vcf containing the first population
     p4.genomes.readFromVCF("'"${inputvcf2}"'", m1);  //PROBLEMATIC: we read a vcf containing the second population;
     p1.setSubpopulationSize(500);  //we change the first population size from 41 (sample) from 500 (total population we had)
     p4.setSubpopulationSize(500);  //we change the second population size from 41 (sample) from 500 (total population we had)
     log = community.createLogFile("sim_log_'"${chr}"'_'"${i}"'.txt", logInterval=1);  //create a log file for this round of the simulation
     log.addCycle();  //print the simulated generation
     log.addCustomColumn("FST", "calcFST(p1.genomes, p4.genomes);");  //print Fst for that generation
     log.addCustomColumn("mutations","sim.mutations.size();");  //print the number of mutations
     sampleFST=calcFST(p1.genomes, p4.genomes);
     catn("sample FST "+ sampleFST);   //print the final Fst
   }' echo "${slimfiletext}" > slimsim_"${chr}"_"${i}".txt   #now we print the simulation text to file
   # and the nwe run the Slim simulation with a unique output file and log file for each chromosome and run

   $slim_executable slimsim_"${chr}"_"${i}".txt > output_"${chr}"_"${i}".log 2>&1
done

All the best and thank you so much, 

Tiina

Ben Haller

unread,
Mar 14, 2024, 9:33:37 AM3/14/24
to Tiina Sävilammi, slim-discuss
Hi Tiina!

Yes, combine the individuals into one VCF, and read that VCF into a nonWF model (it has to be nonWF), as a single subpopulation.  Create the other subpopulation(s) you need with addSubpop(), with an initial size of 0.  Then use takeMigrants() to move individuals to the subpopulations they belong in.  Do all of that in the same event, so that nothing else happens in between; reading the VCF and moving the individuals to their final destinations all happens in one script block.  That ought to work fine; let me know if you hit any snags.  :->

I didn't look at the script you sent, since it seemed like I didn't need to in order to answer your question.  Let me know if I missed something.  :->  Happy modeling!


Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University


Tiina Sävilammi wrote on 3/14/24 4:50 AM:
Reply all
Reply to author
Forward
Message has been deleted
0 new messages