Simulate a large European population / issue with outputVCFSample?

46 views
Skip to first unread message

Jon Lerga Jaso

unread,
Sep 21, 2021, 2:07:35 PM9/21/21
to slim-discuss
Hi all,

First of all, thank you so much for the fantastic work done with SLiM and the documentation!
Now, I just started to use SLiM and maybe this is a naive question since I am still a beginner, but I want to simulate a population of ~ 50k (or more) individuals with the same variants, similar LD patterns and MAF as in the 1KGP EUR population.
For that, what I did was to use a nucleotide-based model and used the HapMap recombination map and a filtered VCF from the 1KGP that include European samples and variants filtered by MAF>0.01. I simulated a neutral evolutionary scenario over several generations and then I produce an VCF as output. My idea was to obtain a proxy population with similar allele frequencies and LD patterns (not exactly the same, of course,  but comparable).
The main issue is that this output VCF has a lot of related people, since I guess the small pool of 500 individuals make a limited source of haplotypes, and therefore the final samples are related, whereas what I need is a large set of unrelated individuals.

What I am asking may be impossible to obtain, but I don't know if there is a way to do that, like increasing the recombination rate at the expense of a lower LD, or including some mutation rate for the 1KGP variants.

Secondly, I may not understand the function outputVCFSample, but if my population has a sample size of 3k individuals, how is it possible for me to extract 10k individuals in the last generation? I though that setSubpopulationSize fixes the number of individuals in the offspring and therefore the final dataset should have at most 3k individuals, but I can extract a higher number of individuals.

Thank you in advance!
Ion Lerga

Script:
initialize() {
    initializeSLiMOptions(nucleotideBased=T);
    length = initializeAncestralNucleotides("PATH/fasta_hg19/hs37d5_chr4.fa");
    defineConstant("L", length);
    initializeMutationTypeNuc("m1", 0.5, "f", 0.0);
    initializeMutationTypeNuc("m2", 0.5, "f", 0.0);
    m2.convertToSubstitution = F;
    initializeGenomicElementType("g1", m1, 1.0, mmJukesCantor(0.0)); // mmJukesCantor(0.0))
    initializeGenomicElement(g1, 0, L-1);
    
    // HapMap recombination map
    lines = readFile("PATH/haps_1kgp/genetic_map_chr4_combined_b37.txt");
    lines = lines[1:(size(lines)-1)];
    rates = NULL;
    ends = NULL;
    for (line in lines)  { 
        components = strsplit(line, " ");
        ends = c(ends, asInteger(components[0]));
        rates = c(rates, asFloat(components[1]));
    }
    ends = c(ends[1:(size(ends)-1)] - 2, length-1);
    rates = rates * 1e-8;
    initializeRecombinationRate(rates, ends);   
    
}

1 late() {
    sim.addSubpop("p1", 503);
    p1.genomes.readFromVCF("PATH/chr4.1kgp.eur.vcf", m1);
    p1.setSubpopulationSize(1000);
}

2:50 late() {

catn(sim.generation);

}

// 2:50 {
//     newSize = asInteger(p1.individualCount + 500);
//     p1.setSubpopulationSize(newSize);   
// }

50 late() { p1.outputVCFSample(500, filePath="PATH/Results/SLiM.simpop.4.vcf"); }


Ben Haller

unread,
Sep 21, 2021, 2:28:35 PM9/21/21
to slim-discuss
Hi Ion,

I think you've correctly diagnosed the problem you're having, yes.  If you start with a small population and run for a short period of time, everybody is going to be closely related.  If you want a larger popular with less relatedness, you'll need to simulate accordingly – start with a bigger VCF file, simulate for longer to allow more divergence, etc.  If that makes the simulation too large to be practical for forward simulation, you might look into the "recapitation" technique that allows you to construct a coalescent burn-in for a SLiM forward simulation; it can cut the simulation times for large models by orders of magnitude.  However, if you need to start from VCF nucleotide data, etc., it may not be suitable for your purposes; it will only give you a neutral coalescent burn-in, it won't supply you with patterns of genetic diversity that match those in a specific empirical population.  If recapitation is not helpful, then model rescaling might bring the problem within reach of forward simulation; you seem to be speculating about that.  Section 5.5 of the SLiM manual has a discussion of rescaling that you might find helpful.  If none of this is satisfactory, then it may be that forward simulation isn't really the tool you want at all; perhaps you just need to define exactly what kind of genetic diversity you want to generate, and then write some sort of statistical model that generates that diversity by sampling directly from the VCF data in some way?  I'm not really sure; this is way outside of my area of expertise.  :->

The doc for outputVCFSample() should clarify your question on that.  It states "The sample may be done either with or without replacement, as specified by replace; the default is to sample with replacement."  With replacement, you take take a larger sample from a population than the size of the population itself.  If you tell it to sample without replacement, then that would produce an error.  With replacement is the default because it has better statistical properties (less bias), as I understand it, but someone else could speak to that better than I can.

I hope this helps.  If you haven't discovered it already, the online SLiM workshop can be very helpful to beginners for getting up to speed; it provides a good introduction to tree-sequence recording and recapitation, for example.  :->  Good luck!

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University


Zuxi Cui

unread,
Sep 21, 2021, 4:39:35 PM9/21/21
to Ben Haller, slim-discuss
Jon,

I have been doing something similar now and it is possible to simulate up to 10K independent samples with SliM from my experience.
First of all, you will need a large amount of RAM. I used 96 GB.
Second, the population size (p1.SetSubpopulation Size) needs to be larger than the output.
Third, I used 1e-7 as the mutation rate (mmJukesCantor()). Otherwise, after all generations, all you will have is big genetic drifts.
Fourth, the kinship estimation from a single chromosome is not reliable. What you have observed is quite expected. 
Although the gene sharing is rather independent across chromosomes, on the same chromosome the gene sharing is not independent at all.
You are more likely to detect false relatives by using a single chromosome for IBD.

I'm building a github page as a guideline for whole-genome simulations with SliM: https://github.com/zxc307/GWAS_simulation_handbook
If you have any related experiences to share, let me know.

Thanks,
Terry


--
SLiM forward genetic simulation: http://messerlab.org/slim/
---
You received this message because you are subscribed to the Google Groups "slim-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slim-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/slim-discuss/df33a9b1-bfbe-4944-bafc-ef51caf758f3n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages